Rui Lu

@RayLu_THU

PhD student in @Tsinghua_Uni studying machine learning theory, graduate from Yao class. Also a youtuber @ 漫士沉思录 manshi_math

Joined October 2022

183 Following

371 Followers

44 Posts

RayLu_THU retweeted

Shenzhi Wang🌟

@ShenzhiWang_THU

10 days ago

🎉The Flexibility Trap has been accepted as an ICML 2026 Oral paper (168 out of 23,198)! Huge thanks to my coauthors, especially @ZanlinNi1 and @YangYue_THU. In this work, we bring the token-entropy-based CoT analysis from Beyond the 80/20 Rule into the dLLM setting. We find that “arbitrary-order generation,” long viewed as a key advantage of dLLMs, does not necessarily expand the solution space on general reasoning tasks such as math and code. Instead, it may allow the model to skip over high-entropy yet crucial logical branching points, effectively “locking” its reasoning potential. Based on this finding, we propose a simple method, JustGRPO: train the model in an AR manner, and then enable parallel decoding at inference time. This improves performance while preserving the high decoding speed of dLLMs. Recently, some friends mentioned that the “keep only the top 20% high-entropy tokens” method proposed in Beyond the 80/20 Rule may not be very robust in some RL settings. My view is that Beyond the 80/20 Rule was trying to highlight an important principle at a time when most analyses focused on overall entropy loss: When analyzing CoT, we should also take a token-level entropy perspective and treat high- and low-entropy tokens differently. High-entropy tokens tend to determine the direction of reasoning, while low-entropy tokens help complete the reasoning content. The “keep 20% high-entropy tokens” method in the 80/20 paper was just a simple proof of concept, a deliberately crude way to show that even a straightforward application of this principle can bring significant gains. Of course, this principle can inspire more elegant methods in future work. The Flexibility Trap is one such example: it applies this principle to a broader setting in a simple and elegant way, and we are very happy to see it recognized as an Oral paper 😎 One small regret is that Beyond the 80/20 Rule also received fairly high NeurIPS scores at the time, but was not selected as a Spotlight. Still, one year later, it is now approaching 500 citations 😆 Please check out our work! I hope token-level entropy analysis can continue to be explored and developed across more areas. The Flexibility Trap: https://t.co/hZxuJGbbb4 Beyond the 80/20 Rule: https://t.co/Tjxe7g4SYZ

ShenzhiWang_THU's tweet photo. 🎉The Flexibility Trap has been accepted as an ICML 2026 Oral paper (168 out of 23,198)! Huge thanks to my coauthors, especially @ZanlinNi1 and @YangYue_THU.

In this work, we bring the token-entropy-based CoT analysis from Beyond the 80/20 Rule into the dLLM setting.

We find that “arbitrary-order generation,” long viewed as a key advantage of dLLMs, does not necessarily expand the solution space on general reasoning tasks such as math and code. Instead, it may allow the model to skip over high-entropy yet crucial logical branching points, effectively “locking” its reasoning potential.

Based on this finding, we propose a simple method, JustGRPO: train the model in an AR manner, and then enable parallel decoding at inference time. This improves performance while preserving the high decoding speed of dLLMs.

Recently, some friends mentioned that the “keep only the top 20% high-entropy tokens” method proposed in Beyond the 80/20 Rule may not be very robust in some RL settings. My view is that Beyond the 80/20 Rule was trying to highlight an important principle at a time when most analyses focused on overall entropy loss:

When analyzing CoT, we should also take a token-level entropy perspective and treat high- and low-entropy tokens differently. High-entropy tokens tend to determine the direction of reasoning, while low-entropy tokens help complete the reasoning content.

The “keep 20% high-entropy tokens” method in the 80/20 paper was just a simple proof of concept, a deliberately crude way to show that even a straightforward application of this principle can bring significant gains. Of course, this principle can inspire more elegant methods in future work.

The Flexibility Trap is one such example: it applies this principle to a broader setting in a simple and elegant way, and we are very happy to see it recognized as an Oral paper 😎

One small regret is that Beyond the 80/20 Rule also received fairly high NeurIPS scores at the time, but was not selected as a Spotlight. Still, one year later, it is now approaching 500 citations 😆

Please check out our work! I hope token-level entropy analysis can continue to be explored and developed across more areas.

The Flexibility Trap: https://t.co/hZxuJGbbb4
Beyond the 80/20 Rule: https://t.co/Tjxe7g4SYZ

ShenzhiWang_THU's tweet photo. 🎉The Flexibility Trap has been accepted as an ICML 2026 Oral paper (168 out of 23,198)! Huge thanks to my coauthors, especially @ZanlinNi1 and @YangYue_THU.

In this work, we bring the token-entropy-based CoT analysis from Beyond the 80/20 Rule into the dLLM setting.

We find that “arbitrary-order generation,” long viewed as a key advantage of dLLMs, does not necessarily expand the solution space on general reasoning tasks such as math and code. Instead, it may allow the model to skip over high-entropy yet crucial logical branching points, effectively “locking” its reasoning potential.

Based on this finding, we propose a simple method, JustGRPO: train the model in an AR manner, and then enable parallel decoding at inference time. This improves performance while preserving the high decoding speed of dLLMs.

Recently, some friends mentioned that the “keep only the top 20% high-entropy tokens” method proposed in Beyond the 80/20 Rule may not be very robust in some RL settings. My view is that Beyond the 80/20 Rule was trying to highlight an important principle at a time when most analyses focused on overall entropy loss:

When analyzing CoT, we should also take a token-level entropy perspective and treat high- and low-entropy tokens differently. High-entropy tokens tend to determine the direction of reasoning, while low-entropy tokens help complete the reasoning content.

The “keep 20% high-entropy tokens” method in the 80/20 paper was just a simple proof of concept, a deliberately crude way to show that even a straightforward application of this principle can bring significant gains. Of course, this principle can inspire more elegant methods in future work.

The Flexibility Trap is one such example: it applies this principle to a broader setting in a simple and elegant way, and we are very happy to see it recognized as an Oral paper 😎

One small regret is that Beyond the 80/20 Rule also received fairly high NeurIPS scores at the time, but was not selected as a Spotlight. Still, one year later, it is now approaching 500 citations 😆

Please check out our work! I hope token-level entropy analysis can continue to be explored and developed across more areas.

The Flexibility Trap: https://t.co/hZxuJGbbb4
Beyond the 80/20 Rule: https://t.co/Tjxe7g4SYZ

ShenzhiWang_THU's tweet photo. 🎉The Flexibility Trap has been accepted as an ICML 2026 Oral paper (168 out of 23,198)! Huge thanks to my coauthors, especially @ZanlinNi1 and @YangYue_THU.

In this work, we bring the token-entropy-based CoT analysis from Beyond the 80/20 Rule into the dLLM setting.

We find that “arbitrary-order generation,” long viewed as a key advantage of dLLMs, does not necessarily expand the solution space on general reasoning tasks such as math and code. Instead, it may allow the model to skip over high-entropy yet crucial logical branching points, effectively “locking” its reasoning potential.

Based on this finding, we propose a simple method, JustGRPO: train the model in an AR manner, and then enable parallel decoding at inference time. This improves performance while preserving the high decoding speed of dLLMs.

Recently, some friends mentioned that the “keep only the top 20% high-entropy tokens” method proposed in Beyond the 80/20 Rule may not be very robust in some RL settings. My view is that Beyond the 80/20 Rule was trying to highlight an important principle at a time when most analyses focused on overall entropy loss:

When analyzing CoT, we should also take a token-level entropy perspective and treat high- and low-entropy tokens differently. High-entropy tokens tend to determine the direction of reasoning, while low-entropy tokens help complete the reasoning content.

The “keep 20% high-entropy tokens” method in the 80/20 paper was just a simple proof of concept, a deliberately crude way to show that even a straightforward application of this principle can bring significant gains. Of course, this principle can inspire more elegant methods in future work.

The Flexibility Trap is one such example: it applies this principle to a broader setting in a simple and elegant way, and we are very happy to see it recognized as an Oral paper 😎

One small regret is that Beyond the 80/20 Rule also received fairly high NeurIPS scores at the time, but was not selected as a Spotlight. Still, one year later, it is now approaching 500 citations 😆

Please check out our work! I hope token-level entropy analysis can continue to be explored and developed across more areas.

The Flexibility Trap: https://t.co/hZxuJGbbb4
Beyond the 80/20 Rule: https://t.co/Tjxe7g4SYZ

1

44

6

28

9K

Rui Lu @RayLu_THU

3 months ago

Actually my advisor asked me to do that. But feature concat in densenet is not suitable for transformer, naively dense accumulation confuses different level feature. My advisor and me both got stuck and had no idea. Really appreciate this work

0

4

1

0

50

Rui Lu @RayLu_THU

3 months ago

When ViT appeared back in 2021, many had been trying to make dense Transformer to reproduce the success of DenseNet. But most of them failed, including my advisor, the author of densenet. It turns out articulating transformer's residual stream has much more nuances. Great work

Kimi.ai @Kimi_Moonshot

3 months ago

Introducing 𝑨𝒕𝒕𝒆𝒏𝒕𝒊𝒐𝒏 𝑹𝒆𝒔𝒊𝒅𝒖𝒂𝒍𝒔: Rethinking depth-wise aggregation. Residual connections have long relied on fixed, uniform accumulation. Inspired by the duality of time and depth, we introduce Attention Residuals, replacing standard depth-wise recurrence with learned, input-dependent attention over preceding layers. 🔹 Enables networks to selectively retrieve past representations, naturally mitigating dilution and hidden-state growth. 🔹 Introduces Block AttnRes, partitioning layers into compressed blocks to make cross-layer attention practical at scale. 🔹 Serves as an efficient drop-in replacement, demonstrating a 1.25x compute advantage with negligible (<2%) inference latency overhead. 🔹 Validated on the Kimi Linear architecture (48B total, 3B activated parameters), delivering consistent downstream performance gains. 🔗Full report: https://t.co/u3EHICG05h

Kimi_Moonshot's tweet photo. Introducing 𝑨𝒕𝒕𝒆𝒏𝒕𝒊𝒐𝒏 𝑹𝒆𝒔𝒊𝒅𝒖𝒂𝒍𝒔: Rethinking depth-wise aggregation.

Residual connections have long relied on fixed, uniform accumulation. Inspired by the duality of time and depth, we introduce Attention Residuals, replacing standard depth-wise recurrence with learned, input-dependent attention over preceding layers.

🔹 Enables networks to selectively retrieve past representations, naturally mitigating dilution and hidden-state growth.
🔹 Introduces Block AttnRes, partitioning layers into compressed blocks to make cross-layer attention practical at scale.
🔹 Serves as an efficient drop-in replacement, demonstrating a 1.25x compute advantage with negligible (<2%) inference latency overhead.
🔹 Validated on the Kimi Linear architecture (48B total, 3B activated parameters), delivering consistent downstream performance gains.

🔗Full report:
https://t.co/u3EHICG05h

336

13K

2K

10K

5M

1

6

0

0

338

RayLu_THU retweeted

5 months ago

Do diffusion models produce novel data points, or reproduce the training data they see? What factors drive this? In a new paper, we study this problem in the 3D generation context. We propose an evaluation framework to quantify memorization in 3D shape generative models. We evaluate existing 3D generators and run controlled experiments to investigate what drives memorization and how to reduce it.

liuzhuang1234's tweet photo. Do diffusion models produce novel data points, or reproduce the training data they see? What factors drive this?

In a new paper, we study this problem in the 3D generation context.

We propose an evaluation framework to quantify memorization in 3D shape generative models. We evaluate existing 3D generators and run controlled experiments to investigate what drives memorization and how to reduce it.

5

300

47

164

20K

Rui Lu @RayLu_THU

6 months ago

feel the reward of research

RayLu_THU's tweet photo. feel the reward of research https://t.co/ZH8a9ScxvH

RayLu_THU's tweet photo. feel the reward of research https://t.co/ZH8a9ScxvH

RayLu_THU's tweet photo. feel the reward of research https://t.co/ZH8a9ScxvH

0

9

0

0

232

Rui Lu @RayLu_THU

6 months ago

Arrived at America to attend #NeurIPS2025, will be in San Diego from Dec 3rd to 8th. Excited to meet and discuss!

0

7

0

0

148

Rui Lu @RayLu_THU

6 months ago

Our paper, Does reinforcement learning really incentivize reasoning capacity beyond base model, has just won the Best Paper Runner Up in #NeurIPS2025! Really honored the idea and insight are recognized, and will stick with relevant and fundamental research. See you in San Diego!

RayLu_THU's tweet photo. Our paper, Does reinforcement learning really incentivize reasoning capacity beyond base model, has just won the Best Paper Runner Up in #NeurIPS2025!
Really honored the idea and insight are recognized, and will stick with relevant and fundamental research. See you in San Diego! https://t.co/Z4ER410ngX

1

7

0

0

371

Rui Lu @RayLu_THU

7 months ago

The whole world's research productivity pauses for 1 hour because of ...

RayLu_THU's tweet photo. The whole world's research productivity pauses for 1 hour because of ... https://t.co/6gZwhWM7la

0

1

0

0

102

Rui Lu @RayLu_THU

7 months ago

@ChaseBrowe32432 It is possible that at a larger scale or with better techniques, the reinforcement learning can make model perform better. But this is not why reasoning LLM works for now.

0

3

0

0

53

Rui Lu @RayLu_THU

7 months ago

@ChaseBrowe32432 thanks for your feedback. After all, reinforcement learning is all about improving the sample efficiency. What we mainly want to argue is that for current paradigm which use grpo or ppo optimizing base model for hundred steps, It mainly works by eliciting instead of improving

0

7

0

1

126

Rui Lu @RayLu_THU

7 months ago

@ChaseBrowe32432 Also we manually checked the trajectory generated by the base model. It does not perform like random guessing. The answers are restricted in a very small set and include the right answer. But the probability of generating this correct answer is quite low

1

6

0

0

236

Rui Lu @RayLu_THU

7 months ago

@ChaseBrowe32432 as the author of this paper, I want to politely point out that there are many other benchmarks that has answers beyond 0-1000 which are hard to guess. Not to mention coding benchmarks that passing the test means solving the problem. All show the same result

3

15

0

2

644

Rui Lu @RayLu_THU

10 months ago

A little milestone in academic life🫡

RayLu_THU's tweet photo. A little milestone in academic life🫡 https://t.co/NGaq73HENd

0

12

0

0

740

Rui Lu @RayLu_THU

11 months ago

First time in my life, finally got the best paper in workshop direction>effort, indeed

RayLu_THU's tweet photo. First time in my life, finally got the best paper
in workshop

direction>effort, indeed https://t.co/8YwudtjE2g

13

572

8

159

32K

RayLu_THU retweeted

Shenzhi Wang🌟

@ShenzhiWang_THU

12 months ago

🧐Two papers, opposite opinions. Ours: High-entropy tokens drive all performance gains in LLM RL. Another: Don’t let low-prob (often high-entropy) tokens over-dominate. Both are valid. Why? 💡Model size matters. Larger LLMs support our view; smaller LLMs support theirs. 🧵⬇️

ShenzhiWang_THU's tweet photo. 🧐Two papers, opposite opinions.

Ours: High-entropy tokens drive all performance gains in LLM RL.

Another: Don’t let low-prob (often high-entropy) tokens over-dominate.

Both are valid. Why?
💡Model size matters. Larger LLMs support our view; smaller LLMs support theirs.

🧵⬇️ https://t.co/b0r8Fd089r

7

507

69

495

39K

Rui Lu @RayLu_THU

about 1 year ago

@iamarsibragimov indeed, it seems that although the overall reasoning trajectory is long, the important choices are made at very few critical moments

0

0

0

0

7

Rui Lu @RayLu_THU

about 1 year ago

How does reasoning model actually reason？ Our recent study shows that only 20% tokens with the high entropy play a critical role in deciding the reasoning trajectory! Check us out

Shenzhi Wang🌟

@ShenzhiWang_THU

about 1 year ago

🚨Beyond 80/20 in LLM reasoning🚨Dropping 80% low-entropy tokens in RL greatly boosts performance 🔗https://t.co/LmdaCRjNvZ 🏆Zero-RL SoTA: 63.5/68.1 (AIME24), 56.7 (AIME25) 🚀Insights: 1. RL retains base model entropy patterns 2. High-entropy tokens drive all RL improvement ⬇️

ShenzhiWang_THU's tweet photo. 🚨Beyond 80/20 in LLM reasoning🚨Dropping 80% low-entropy tokens in RL greatly boosts performance
🔗https://t.co/LmdaCRjNvZ

🏆Zero-RL SoTA: 63.5/68.1 (AIME24), 56.7 (AIME25)
🚀Insights:
1. RL retains base model entropy patterns
2. High-entropy tokens drive all RL improvement
⬇️ https://t.co/YEsojmbkCZ

10

292

54

201

53K

2

18

2

12

2K

RayLu_THU retweeted

Jiahao Qiu @JiahaoQiu99

about 1 year ago

The GAIA game is over, and Alita is the final answer. Alita takes the top spot in GAIA, outperforming OpenAI Deep Research and Manus. Many general-purpose agents rely heavily on large-scale, manually predefined tools and workflows. However, we believe that for general AI assistants: "Simplicity is the ultimate sophistication." 🔗Full paper: https://t.co/KoApMuFGFj 🔗More Details will be updated here: https://t.co/uGH3PHFbnG #AI #Agent #LLM

JiahaoQiu99's tweet photo. The GAIA game is over, and Alita is the final answer.

Alita takes the top spot in GAIA, outperforming OpenAI Deep Research and Manus.

Many general-purpose agents rely heavily on large-scale, manually predefined tools and workflows. However, we believe that for general AI assistants:

"Simplicity is the ultimate sophistication."

🔗Full paper: https://t.co/KoApMuFGFj

🔗More Details will be updated here: https://t.co/uGH3PHFbnG

#AI #Agent #LLM

17

97

32

43

26K

Rui Lu @RayLu_THU

about 1 year ago

@ArtemKRSV seems that it hits the cliff and an enormous gradient just blows you model away. you may need gradient clipping

0

0

0

0

53

Rui Lu @RayLu_THU

about 1 year ago

The only thing that can stop the progress of AGI... is overleaf before NeuRIPS deadline🙃

RayLu_THU's tweet photo. The only thing that can stop the progress of AGI...
is overleaf before NeuRIPS deadline🙃 https://t.co/jI6rAnxO0w

0

5

0

0

903

Last Seen Users on Sotwe

Trends for you

Most Popular Users