Yunzhen Feng @feeelix_feng - Twitter Profile

Yulin Chen ✈️ ICML2026 @YulinChen99

about 1 month ago

Most assume unlearnable examples never get positive reward. They do. In our ICML paper, We reveal that a hard problem can receive positive reward during RLVR but remain unlearned. We show the phenomenon is more likely a representation issue rather than RL optimization artifact.

YulinChen99's tweet photo. Most assume unlearnable examples never get positive reward. They do.
In our ICML paper, We reveal that a hard problem can receive positive reward during RLVR but remain unlearned.

We show the phenomenon is more likely a representation issue rather than RL optimization artifact. https://t.co/Pvkcrcd0XN

7

368

38

262

29K

Yunzhen Feng @feeelix_feng

about 1 month ago

A new sampling-based defense against model distillation: unbiased for user but hurt attacker

Zibo Diao @ZiboDiao

about 1 month ago

🚀 Excited to share our new paper: “Lossless Anti-Distillation Sampling” (LADS)! We propose a sampling-based defense against multi-account distillation that weakens distillation while preserving a lossless experience for benign users. 🛡️ Paper: https://t.co/lAvcDxmvac

ZiboDiao's tweet photo. 🚀 Excited to share our new paper: “Lossless Anti-Distillation Sampling” (LADS)!

We propose a sampling-based defense against multi-account distillation that weakens distillation while preserving a lossless experience for benign users. 🛡️

Paper: https://t.co/lAvcDxmvac https://t.co/1EpBIi4Llr

1

13

3

6

9K

0

1

0

1

387

feeelix_feng retweeted

Tianle Cai

@tianle_cai

3 months ago

https://t.co/CivOb4riiJ

20

650

99

816

225K

feeelix_feng retweeted

Yuda Song @yus167

5 months ago

RL on LLMs inefficiently uses one scalar per rollout. But users regularly give much richer feedback: "make it formal," "step 3 is wrong." Can we train LLMs on this human-AI interaction? We introduce RL from Text Feedback, with 1) Self-Distillation; 2) Feedback Modeling (1/n) 🧵

yus167's tweet photo. RL on LLMs inefficiently uses one scalar per rollout. But users regularly give much richer feedback: "make it formal," "step 3 is wrong."

Can we train LLMs on this human-AI interaction?

We introduce RL from Text Feedback, with 1) Self-Distillation; 2) Feedback Modeling (1/n) 🧵 https://t.co/i8ncPFKq70

14

595

102

494

108K

Who to follow

Kaixuan Huang

@KaixuanHuang1

PhD Student @Princeton; Google PhD Fellowship 2024, Ex-Intern @GoogleDeepMind; undergrad @PKU1898. opinions my own

Tianyuan Zhang

@tianyuanzhang99

General intelligence and continue learning at meta tbd lab. prev Phd in MIT, M.S. in CMU, B.S. in PKU.

Difan Zou

@difanzou

Assistant Professor, CS & IDS, The University of Hong Kong. Previous CS PhD student at UCLA.

feeelix_feng retweeted

Shobhita Sundaram

@shobsund

5 months ago

Can a model learn to break its own reasoning plateau? In our new paper, we show that LLMs can be taught with meta-RL to generate their own "stepping stones" that kickstart learning on hard math problems (0/128 success rate) where direct RL fails. Paper 📝: https://t.co/lUlrJt6bwq Blog post 🌐: https://t.co/v1y24h1fP4 (1/n)

shobsund's tweet photo. Can a model learn to break its own reasoning plateau?

In our new paper, we show that LLMs can be taught with meta-RL to generate their own "stepping stones" that kickstart learning on hard math problems (0/128 success rate) where direct RL fails.

Paper 📝: https://t.co/lUlrJt6bwq
Blog post 🌐: https://t.co/v1y24h1fP4

(1/n)

21

679

111

528

109K

feeelix_feng retweeted

Luhuan Wu @hlws_bot

7 months ago

🚀 ML / Applied Math / Stats PhD Opportunities @JohnsHopkins I'm recruiting PhD students excited about generative modeling, probabilistic inference, and scientific applications (biochemistry, physics, and more), with strong backgrounds in CS/Math/Stats/Basic Science and curiosity for advancing ML and solving real-world problems! Apply to our Applied Mathematics and Statistics PhD program by Dec 15, 2025, and become part of the broader @HopkinsDSAI community! https://t.co/2YJwqS4FzK

6

196

34

83

22K

Yunzhen Feng @feeelix_feng

7 months ago

@codewithimanshu @KempeLab Best reasoning: Be accurate first and then improve the efficiency

0

10

Yunzhen Feng @feeelix_feng

7 months ago

I’ll be at #NeurIPS2025 until 12/7!👋 Please reach out if you want to chat about RL, reasoning, self-evolving, or LLM diversity. My Pre: 🌟 Fri, Dec 5 (11a-2p): Spotlight on Synthetic Data Scheduling, #4108 🌟 Sat, Dec 6 (11:30a & 4:30p): Spotlight on evaluating CoT, Hall F

feeelix_feng's tweet photo. I’ll be at #NeurIPS2025 until 12/7!👋
Please reach out if you want to chat about RL, reasoning, self-evolving, or LLM diversity.

My Pre:
🌟 Fri, Dec 5 (11a-2p): Spotlight on Synthetic Data Scheduling, #4108
🌟 Sat, Dec 6 (11:30a & 4:30p): Spotlight on evaluating CoT, Hall F https://t.co/GiChZoA3OZ

0

8

1

3

450

feeelix_feng retweeted

Julia Kempe

@KempeLab

7 months ago

I will be recruiting 1-2 PhD students at @NYUDataScience or @NYUCourant CS to work on Machine Learning & applications in NYU's vibrant top ML ecosystem. Check Google Scholar to see our latest research interests. Interested? Please mention my name in your application. Deadl. 12/12

5

334

80

159

27K

Yunzhen Feng @feeelix_feng

7 months ago

@zorikgekhman Hey Zorik, thanks for the interest in our work. Could you share your email address?

1

0

18

Yunzhen Feng @feeelix_feng

9 months ago

🔥 NEW PAPER: What makes reasoning traces effective in LLMs? Spoiler: It's NOT length or self-checking. We found a simple graph metric that predicts accuracy better than anything else—and proved it causally. 🧵[1/n]

feeelix_feng's tweet photo. 🔥 NEW PAPER: What makes reasoning traces effective in LLMs? Spoiler: It's NOT length or self-checking. We found a simple graph metric that predicts accuracy better than anything else—and proved it causally. 🧵[1/n] https://t.co/m7EY906Azm

4

177

26

116

10K

Yunzhen Feng @feeelix_feng

8 months ago

@jxmnop Instruction following ability in the generation?

0

175

feeelix_feng retweeted

Nikos Tsilivis @nikostsilivis

8 months ago

RL has led to amazing advances in reasoning domains with LLMs. But why has it been so successful, and why does the length of the response increases during RL? In new work, we introduce a framework to provide conceptual and theoretical answers to these questions.

nikostsilivis's tweet photo. RL has led to amazing advances in reasoning domains with LLMs.

But why has it been so successful, and why does the length of the response increases during RL? In new work, we introduce a framework to provide conceptual and theoretical answers to these questions. https://t.co/3qRV16ilw0

2

61

14

49

5K

feeelix_feng retweeted

Saining Xie

@sainingxie

8 months ago

three years ago, DiT replaced the legacy unet with a transformer-based denoising backbone. we knew the bulky VAEs would be the next to go -- we just waited until we could do it right. today, we introduce Representation Autoencoders (RAE). >> Retire VAEs. Use RAEs. 👇(1/n)

sainingxie's tweet photo. three years ago, DiT replaced the legacy unet with a transformer-based denoising backbone. we knew the bulky VAEs would be the next to go -- we just waited until we could do it right.

today, we introduce Representation Autoencoders (RAE).

>> Retire VAEs. Use RAEs. 👇(1/n)

57

2K

326

1K

415K

Yunzhen Feng @feeelix_feng

8 months ago

@AntChen_ @KempeLab @YaqiDuanPKU @jparag123 @tonyjhartshorn @AIatMeta @NYUDataScience 1) Yes, but p* does not sum to 1 over all o. It represents the probability of correctness of o given q. 2) Yes. For the same reason, we need to scale the policy probability into correctness probability.

0

125

Yunzhen Feng @feeelix_feng

8 months ago

Current GRPO wastes compute on negative groups — when all K samples are wrong, you get zero gradient despite full generation cost. We propose a principled fix by bridging reward modeling and policy optimization: 👉 Penalize highly confident wrong answers more to create signal.🧵

feeelix_feng's tweet photo. Current GRPO wastes compute on negative groups — when all K samples are wrong, you get zero gradient despite full generation cost.

We propose a principled fix by bridging reward modeling and policy optimization:
👉 Penalize highly confident wrong answers more to create signal.🧵 https://t.co/UjiERoFALo

7

340

40

286

40K

Yunzhen Feng @feeelix_feng

8 months ago

@josancamon19 @KempeLab @YaqiDuanPKU @jparag123 @tonyjhartshorn Are you referring to the dynamic sampling in DAPO? In DAPO, they oversample and then filter out all the negative groups. In contrast, we aim to recover training signal from those discarded groups.

0

1

0

356

Yunzhen Feng @feeelix_feng

8 months ago

@siddarthv66 Prompt changes does affect eval. But both the baseline training and our method use the same eval setup - for fair comparison. The experiments are run with university compute, so I wish I could run more. We are experimenting using LoRA to train.

0

1

0

128

Yunzhen Feng @feeelix_feng

8 months ago

@siddarthv66 It was for page limit so we put the Numina 1.5 in the appendix. The eval challenge is mostly for Llama. The accuracy is <1% for AIME25. We do not want GSM or AMC because they're saturated. What else benchmark are there for math that is not contaminated?

1

0

550

Yunzhen Feng @feeelix_feng

8 months ago

@KempeLab @YaqiDuanPKU @jparag123 @tonyjhartshorn @AIatMeta @NYUDataScience Key observation: (1) Our method continues to improve accuracy when GRPO saturates ⬆️ (2) Our method improves all Pass@k metrics This matches our intuition—by learning from negative groups, we get better exploration on hard problems where it matters most.

feeelix_feng's tweet photo. @KempeLab @YaqiDuanPKU @jparag123 @tonyjhartshorn @AIatMeta @NYUDataScience Key observation:
(1) Our method continues to improve accuracy when GRPO saturates ⬆️
(2) Our method improves all Pass@k metrics
This matches our intuition—by learning from negative groups, we get better exploration on hard problems where it matters most. https://t.co/SrSfOelxkl

0

10

1

3

739

Yunzhen Feng @feeelix_feng

8 months ago

@KempeLab @YaqiDuanPKU @jparag123 @tonyjhartshorn @AIatMeta @NYUDataScience We experiment on two different training sets with Llama-3.1-8B and Qwen-2.5-3B 📈 For MATH+DAPO, we run two random seeds. Our method consistently outperforms GRPO across training, with significant improvements on hard problems (Level 4-5)

feeelix_feng's tweet photo. @KempeLab @YaqiDuanPKU @jparag123 @tonyjhartshorn @AIatMeta @NYUDataScience We experiment on two different training sets with Llama-3.1-8B and Qwen-2.5-3B 📈

For MATH+DAPO, we run two random seeds. Our method consistently outperforms GRPO across training, with significant improvements on hard problems (Level 4-5) https://t.co/rKQlgE0528

1

8

1

3

872

Yunzhen Feng

@feeelix_feng

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users