L. Junbum @__beomi__ - Twitter Profile

Pinned Tweet

about 2 years ago

#Llama3 #Llama3Ko #Llama3OpenKo I’ve just uploaded the first version of the Llama-3-Ko series, the Llama-3-Open-Ko-8B model, to Hugging Face! You can check it out here: https://t.co/01r6TRVKAP.

3

15

6

4

2K

__Beomi__ retweeted

Haozhe Jiang @erichzjiang

1 day ago

Why aren’t Diffusion Language Model smart yet? Lacking stable post training is a major bottleneck! Meet DiPOD: the tripod for diffusion model post-training. DiPOD boosts accuracy across reasoning tasks, with Sudoku jumping from 22% to 97%, through a one-line code change. 🧵1/5

6

349

48

311

33K

__Beomi__ retweeted

jietang

@jietang

1 day ago

We're introducing GLM-5.2, our latest flagship model for long-horizon tasks. It marks a substantial leap in long-horizon task capability over its predecessor GLM-5.1 and, for the first time, delivers that capability on a solid 1M-token context. GLM-5.2's new capabilities include: Solid 1M Context: A solid 1M-token context that stably sustains long-horizon work Advanced Coding with Flexible Effort: Stronger coding capabilities with multiple thinking effort levels to balance performance and latency Improved Architecture: We propose IndexShare, which reuses the same indexer across every four sparse attention layers, reducing per-token FLOPs by 2.9× at a 1M context length. We also improve GLM-5.2’s MTP layer for speculative decoding, increasing the acceptance length by up to 20% Pure Open: An MIT open-source license — no regional limits, technical access without borders Supporting long-horizon tasks starts with making long context engineering-usable: the model must maintain quality across long, messy coding-agent trajectories, not just accept more tokens. A 1M context is easy to claim, but much harder to keep reliable under real engineering pressure. To this end, we substantially expanded 1M-context training for coding-agent scenarios, covering large-scale implementation, automated research, performance optimization, and complex debugging. The result is a long-context system that is not only wide in scope, but solid in execution: a practical substrate for sustained engineering work. This capability is reflected in GLM-5.2's performance on three long-horizon coding benchmarks. FrontierSWE measures whether an agent can complete open-ended technical projects at the scale of hours to tens of hours, spanning systems optimization, large-scale code construction, and applied ML research. On this benchmark, GLM-5.2 trails Opus 4.8 by only 1%, while edging out GPT-5.5 by 1% and Opus 4.7 by 11%. On PostTrainBench, where each agent is given an H100 GPU and evaluated by how much it can improve small models through post-training, GLM-5.2 outperforms both Opus 4.7 and GPT-5.5, ranking second only to Opus 4.8. On SWE-Marathon, an ultra-long-horizon software engineering benchmark covering tasks such as building compilers, optimizing kernels, and developing production-grade services, GLM-5.2 still has room to grow, trailing Opus 4.8 by 13% while remaining second only to the Opus series. Across all three benchmarks, GLM-5.2 is the highest-ranked open-source model, showing that its 1M context has translated into practical long-horizon delivery capability.

jietang's tweet photo. We're introducing GLM-5.2, our latest flagship model for long-horizon tasks. It marks a substantial leap in long-horizon task capability over its predecessor GLM-5.1 and, for the first time, delivers that capability on a solid 1M-token context. GLM-5.2's new capabilities include:

Solid 1M Context: A solid 1M-token context that stably sustains long-horizon work
Advanced Coding with Flexible Effort: Stronger coding capabilities with multiple thinking effort levels to balance performance and latency
Improved Architecture: We propose IndexShare, which reuses the same indexer across every four sparse attention layers, reducing per-token FLOPs by 2.9× at a 1M context length. We also improve GLM-5.2’s MTP layer for speculative decoding, increasing the acceptance length by up to 20%
Pure Open: An MIT open-source license — no regional limits, technical access without borders
Supporting long-horizon tasks starts with making long context engineering-usable: the model must maintain quality across long, messy coding-agent trajectories, not just accept more tokens. A 1M context is easy to claim, but much harder to keep reliable under real engineering pressure. To this end, we substantially expanded 1M-context training for coding-agent scenarios, covering large-scale implementation, automated research, performance optimization, and complex debugging. The result is a long-context system that is not only wide in scope, but solid in execution: a practical substrate for sustained engineering work.

This capability is reflected in GLM-5.2's performance on three long-horizon coding benchmarks. FrontierSWE measures whether an agent can complete open-ended technical projects at the scale of hours to tens of hours, spanning systems optimization, large-scale code construction, and applied ML research. On this benchmark, GLM-5.2 trails Opus 4.8 by only 1%, while edging out GPT-5.5 by 1% and Opus 4.7 by 11%. On PostTrainBench, where each agent is given an H100 GPU and evaluated by how much it can improve small models through post-training, GLM-5.2 outperforms both Opus 4.7 and GPT-5.5, ranking second only to Opus 4.8. On SWE-Marathon, an ultra-long-horizon software engineering benchmark covering tasks such as building compilers, optimizing kernels, and developing production-grade services, GLM-5.2 still has room to grow, trailing Opus 4.8 by 13% while remaining second only to the Opus series. Across all three benchmarks, GLM-5.2 is the highest-ranked open-source model, showing that its 1M context has translated into practical long-horizon delivery capability.

144

3K

244

364

248K

__Beomi__ retweeted

Z.ai @Zai_org

1 day ago

Introducing GLM-5.2: Frontier Intelligence, Open Weights - Significant improvements in coding and agentic tasks - Strong long-horizon capabilities with a 1M context window - Two levels of reasoning effort: GLM-5.2 (max) pushes the limits, while GLM-5.2 (high) strikes a strong balance between performance and token efficiency - MIT-licensed open weights - Same API pricing as GLM-5.1 Tech Blog: https://t.co/LAsxUdN0JZ Weights: https://t.co/g0A1C4UWx4 API: https://t.co/Kc3E22cbN7 Coding Plan: https://t.co/Nk8Y98HNhU Chat: https://t.co/WCqWT0qCQb

Zai_org's tweet photo. Introducing GLM-5.2: Frontier Intelligence, Open Weights

- Significant improvements in coding and agentic tasks
- Strong long-horizon capabilities with a 1M context window
- Two levels of reasoning effort: GLM-5.2 (max) pushes the limits, while GLM-5.2 (high) strikes a strong balance between performance and token efficiency
- MIT-licensed open weights
- Same API pricing as GLM-5.1

Tech Blog: https://t.co/LAsxUdN0JZ
Weights: https://t.co/g0A1C4UWx4
API: https://t.co/Kc3E22cbN7
Coding Plan: https://t.co/Nk8Y98HNhU
Chat: https://t.co/WCqWT0qCQb

498

9K

1K

3K

4M

Who to follow

Security Researcher | Low Level Systems

__Beomi__ retweeted

RyanLee

@RyanLeeMiniMax

7 days ago

Hey everyone — our high-performance MSA kernel library is now open-source. The M3 weights are expected to drop this Friday. Thanks for waiting! Github: https://t.co/7hixC7FNg7 Paper：https://t.co/t1nHSJgGwB

RyanLeeMiniMax's tweet photo. Hey everyone — our high-performance MSA kernel library is now open-source. The M3 weights are expected to drop this Friday. Thanks for waiting!
Github: https://t.co/7hixC7FNg7
Paper：https://t.co/t1nHSJgGwB https://t.co/7Q5K13MG41

26

971

108

465

115K

__Beomi__ retweeted

Jeff Dean

@JeffDean

6 days ago

Quite interesting thread on capabilities of real biological neurons (spoiler: they're way more capable than classical artificial neurons in a perceptron) . Nice work @IdoAizenbud and collaborators!

28

672

83

417

97K

__Beomi__ retweeted

MiniMax (official) @MiniMax_AI

6 days ago

MiniMax M3, Open-Weight, Now On Hugging Face , with only ~428B parameters and ~23B activated parameters Weights: https://t.co/g4Ybfa2kWH MiniMax Sparse Attention: https://t.co/HcTlWRotG3

113

3K

329

540

679K

__Beomi__ retweeted

Kimi Developers

@KimiDevs

6 days ago

Meet Kimi-K2.7-Code 👀 Here’s what developers should know to fully unlock K2.7-Code potential：

78

3K

174

348

220K

__Beomi__ retweeted

elie

@eliebakouch

7 days ago

early results on how to use tool call result to make agent model the world amazing work (and blog) by @omouamoua https://t.co/6Xb4Nzov9F

eliebakouch's tweet photo. early results on how to use tool call result to make agent model the world

amazing work (and blog) by @omouamoua

https://t.co/6Xb4Nzov9F https://t.co/TPBnEcooao

5

254

24

163

22K

__Beomi__ retweeted

Google Gemma

@googlegemma

8 days ago

Meet DiffusionGemma! An experimental open model that explores a fast approach to text generation, released under an Apache 2.0 license. Moving beyond sequential, token-by-token processes to generate entire blocks of text simultaneously. Here’s what’s new with DiffusionGemma: 👇

169

5K

811

2K

932K

__Beomi__ retweeted

elie

@eliebakouch

about 1 month ago

this is fascinating, they train an encoder/decoder but use LLM matching the target model's shape for each part, so the latent space is just plain language and they can detect reward hacking, unwanted behavior and more could even see it being used as an eval to quantify how smart a model is, i love this

eliebakouch's tweet photo. this is fascinating, they train an encoder/decoder but use LLM matching the target model's shape for each part, so the latent space is just plain language and they can detect reward hacking, unwanted behavior and more

could even see it being used as an eval to quantify how smart a model is, i love this

21

1K

109

873

111K

__Beomi__ retweeted

Adithya S K

@adithya_s_k

about 1 month ago

When people ask me what reward hacking in RL is > here’s the simplest example I personally saw and share to explain it in 1 min: Last year, when GRPO first came out and TRL had an implemented it, people were fine-tuning very small model such as Qwen 0.5B/llama 1b,3b on the GSM8K dataset to make these models better at solving math problems. I took a slightly different approach. Instead of asking the LLM directly to solve math problems, I asked it to generate Python code to solve them and print the outputs. I was experimenting with different rewards and in one of the run I gave the usual format and correctness rewards, but also added a code execution reward, giving it a between 0 to 2 reward if the code executed without errors and few other conditions I was training on Qwen 0.5B, and things were doing okay. The format reward improved, and correctness was on and off. But suddenly, at some point, the code execution reward shot up and stayed high. When I looked at the traces, I saw that it had started adding try-except blocks so the code would execute without throwing errors. This is a simple example of reward hacking: the model learned to maximize the reward without actually solving the task correctly. ps: this experiment was a year ago, and have mentioned to couple of folks offline as well, there is a lot more to reward hacking than this and this example is over simplification but easy to understand and relate to hope it helps

adithya_s_k's tweet photo. When people ask me what reward hacking in RL is

> here’s the simplest example I personally saw and share to explain it in 1 min:

Last year, when GRPO first came out and TRL had an implemented it, people were fine-tuning very small model such as Qwen 0.5B/llama 1b,3b on the GSM8K dataset to make these models better at solving math problems.

I took a slightly different approach. Instead of asking the LLM directly to solve math problems, I asked it to generate Python code to solve them and print the outputs.

I was experimenting with different rewards and in one of the run I gave the usual format and correctness rewards, but also added a code execution reward, giving it a between 0 to 2 reward if the code executed without errors and few other conditions

I was training on Qwen 0.5B, and things were doing okay. The format reward improved, and correctness was on and off. But suddenly, at some point, the code execution reward shot up and stayed high.

When I looked at the traces, I saw that it had started adding try-except blocks so the code would execute without throwing errors.

This is a simple example of reward hacking: the model learned to maximize the reward without actually solving the task correctly.

ps: this experiment was a year ago, and have mentioned to couple of folks offline as well, there is a lot more to reward hacking than this and this example is over simplification but easy to understand and relate to hope it helps

11

391

17

243

69K

__Beomi__ retweeted

AK

@_akhaliq

about 1 month ago

Continuous Latent Diffusion Language Model paper: https://t.co/qgUyX7D1Mr

4

328

58

226

73K

__Beomi__ retweeted

hardmaru

@hardmaru

about 1 month ago

The human brain🧠 is incredibly efficient because it only activates the specific neurons needed for a thought. Modern LLMs naturally try to do this too (> 95% of neurons in feedforward layers stay silent for any given word), but our hardware punishes them for it. One of the most frustrating paradoxes in deep learning: making a model do less math often makes it run slower. Why? Because unstructured sparsity introduces irregular memory access, and GPUs are built for predictable, dense blocks of math. We teamed up with @NVIDIA to try to fix this hardware mismatch. Instead of forcing the GPU to adapt to the sparsity, we built a "Hybrid" format that reshapes the sparsity to fit the GPU. Our sparsity format (TwELL) dynamically routes the 99% of highly sparse tokens through a fast path, and uses a dense backup matrix as a safety valve for the rare, heavy tokens. Through TwELL and a new set of custom CUDA kernels for both LLM inference and training, we translated theoretical sparsity into actual wall-clock speedups: >20% faster training and inference on H100 GPUs, while also cutting energy consumption and memory requirements. Paper: https://t.co/rqIY9SYBDe Blog: https://t.co/oRjNbpJKha Code: https://t.co/FAFaJwpxAJ ⚡️

hardmaru's tweet photo. The human brain🧠 is incredibly efficient because it only activates the specific neurons needed for a thought. Modern LLMs naturally try to do this too (> 95% of neurons in feedforward layers stay silent for any given word), but our hardware punishes them for it.

One of the most frustrating paradoxes in deep learning: making a model do less math often makes it run slower. Why? Because unstructured sparsity introduces irregular memory access, and GPUs are built for predictable, dense blocks of math.

We teamed up with @NVIDIA to try to fix this hardware mismatch. Instead of forcing the GPU to adapt to the sparsity, we built a "Hybrid" format that reshapes the sparsity to fit the GPU. Our sparsity format (TwELL) dynamically routes the 99% of highly sparse tokens through a fast path, and uses a dense backup matrix as a safety valve for the rare, heavy tokens.

Through TwELL and a new set of custom CUDA kernels for both LLM inference and training, we translated theoretical sparsity into actual wall-clock speedups: >20% faster training and inference on H100 GPUs, while also cutting energy consumption and memory requirements.

Paper: https://t.co/rqIY9SYBDe
Blog: https://t.co/oRjNbpJKha
Code: https://t.co/FAFaJwpxAJ
⚡️

52

3K

509

3K

432K

__Beomi__ retweeted

机器之心 JIQIZHIXIN

@jiqizhixin

about 2 months ago

Attention to Mamba: A Recipe for Cross-Architecture Distillation Paper: https://t.co/28ru53QO16

0

14

2

10

953

__Beomi__ retweeted

Google Research

@GoogleResearch

about 2 months ago

Google presents a new Transformer alternative at #ICLR2026! Join Nino Scherrer & Yanick Schimpf at the Google booth (#411) at 10AM to learn about MesaNet, proposing a new linear sequence layer that optimally learns in-context given a fixed memory budget.

GoogleResearch's tweet photo. Google presents a new Transformer alternative at #ICLR2026! Join Nino Scherrer & Yanick Schimpf at the Google booth (#411) at 10AM to learn about MesaNet, proposing a new linear sequence layer that optimally learns in-context given a fixed memory budget. https://t.co/OFo0Qtb1KI

37

957

130

543

61K

__Beomi__ retweeted

Omar Sanseviero

@osanseviero

about 2 months ago

Introducing Decoupled DiLoCo, a breakthrough in large scale distributed training Low bandwidth way of training globally in a distributed setup. DiLoCo and follow-up works can be quite game changing in the industry and I'm very excited to see how they evolve

osanseviero's tweet photo. Introducing Decoupled DiLoCo, a breakthrough in large scale distributed training

Low bandwidth way of training globally in a distributed setup. DiLoCo and follow-up works can be quite game changing in the industry and I'm very excited to see how they evolve https://t.co/E7Xi6AEgTa

4

141

13

27

8K

__Beomi__ retweeted

Jeff Dean

@JeffDean

about 2 months ago

It's been a delight to provide small amounts of advice and suggestions to people working on the Decoupled DiLoCo training system. This approach enables graceful handling of failures in large scale training jobs, by allowing (N-1) / N units to proceed when one fails. Thread ⬇️

23

686

75

243

109K

__Beomi__ retweeted

Luke Bailey

@LukeBailey181

about 2 months ago

Self-play led to superhuman Go performance, why hasn’t it for LLMs? In practice, long run self-play plateaus like RL. We study why this happens, and build a self-play algorithm that scales better. It solves as many problems with a 7B model as the pass@4 of a model 100x bigger.

31

1K

152

876

146K

__Beomi__ retweeted

OpenAI Developers

@OpenAIDevs

about 2 months ago

GPT-5.5 is here. It’s our smartest frontier model yet, introducing a new class of intelligence for agentic coding, computer use, knowledge work, and scientific research. Rolling out in ChatGPT and Codex today. API is coming soon.

231

6K

566

680

600K

__Beomi__ retweeted

Kimi.ai @Kimi_Moonshot

about 2 months ago

We're open-sourcing FlashKDA — our high-performance CUTLASS-based implementation of Kimi Delta Attention kernels. Achieves 1.72×–2.22× prefill speedup over the flash-linear-attention baseline on H20, and works as a drop-in backend for flash-linear-attention. Explore on github: https://t.co/sf4UohXDWY

45

2K

184

618

214K

L. Junbum

@Beomi

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users

L. Junbum

@__Beomi__

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users

@Beomi