Arun Iyer @aiongradflow - Twitter Profile

1 day ago

1/ 🧵 Meet Tapered Language Models (TLMs): Modern language models (transformer, recurrent, memory-based) are a stack of *identical* layers. A uniform parameter distribution across depth, inherited from the 2017 transformer and rarely questioned. Turns out it's leaving free performance on the table. Perplexity drops 16.28 → 14.44 (same params, same compute) 👇

reza_byt's tweet photo. 1/ 🧵 Meet Tapered Language Models (TLMs):

Modern language models (transformer, recurrent, memory-based) are a stack of *identical* layers. A uniform parameter distribution across depth, inherited from the 2017 transformer and rarely questioned.

Turns out it's leaving free performance on the table.

Perplexity drops 16.28 → 14.44 (same params, same compute) 👇

8

144

34

135

36K

AIonGradFlow retweeted

Qwen

@Alibaba_Qwen

2 days ago

📣📣 Meet Qwen-AgentWorld — a native language world model that simulates 7 agent environments (MCP, Search, Terminal, SWE, Web, OS, Android) within a single model. Environment modeling is the training objective from day one, not a post-hoc adaptation. 🤔 LLMs are trained to be better agents — better at acting in environments. But nobody has trained them to model the environments themselves. 🗺️ Our roadmap: investigate how language world modeling can push the boundaries of general agent capabilities, along two routes: 1️⃣ Build a foundation model for environment simulation — outperforming Claude Opus 4.8 and GPT-5.4 on AgentWorldBench 2️⃣ Investigate how world modeling enhances agent training: 🔬 Controllable Sim RL (agentic RL with LWM as environments) surpasses training in real environments 🧠 Learning to predict environments (LWM warm-up) makes agents stronger — remarkably, even without any agent-specific training, this predictive knowledge transfers to agentic tasks with zero fine-tuning 📑 Paper: https://t.co/Jx2l5RKq71 📖 Blog: https://t.co/7tVcKyhsx2 �� GitHub: https://t.co/B5Lvb1UZCn 🤗 HuggingFace: https://t.co/Kw3QBL1TM5 🧩 ModelScope: https://t.co/YBnGYgMWWI

Alibaba_Qwen's tweet photo. 📣📣 Meet Qwen-AgentWorld — a native language world model that simulates 7 agent environments (MCP, Search, Terminal, SWE, Web, OS, Android) within a single model. Environment modeling is the training objective from day one, not a post-hoc adaptation.

🤔 LLMs are trained to be better agents — better at acting in environments. But nobody has trained them to model the environments themselves.

🗺️ Our roadmap: investigate how language world modeling can push the boundaries of general agent capabilities, along two routes:

1️⃣ Build a foundation model for environment simulation — outperforming Claude Opus 4.8 and GPT-5.4 on AgentWorldBench

2️⃣ Investigate how world modeling enhances agent training:
🔬 Controllable Sim RL (agentic RL with LWM as environments) surpasses training in real environments
🧠 Learning to predict environments (LWM warm-up) makes agents stronger — remarkably, even without any agent-specific training, this predictive knowledge transfers to agentic tasks with zero fine-tuning

📑 Paper: https://t.co/Jx2l5RKq71
📖 Blog: https://t.co/7tVcKyhsx2
�� GitHub: https://t.co/B5Lvb1UZCn
🤗 HuggingFace: https://t.co/Kw3QBL1TM5
🧩 ModelScope: https://t.co/YBnGYgMWWI

191

5K

774

4K

1M

AIonGradFlow retweeted

Ankur Samanta @Ankur_Samanta_

4 days ago

🚀New work on credit assignment in multi-step reasoning RL post-training🚀 Introducing Self-Reset Policy Optimization (SRPO): i) localize the first wrong reasoning step, ii) reset to that step, iii) learn from counterfactual continuations from there – no external supervision.🧵

Ankur_Samanta_'s tweet photo. 🚀New work on credit assignment in multi-step reasoning RL post-training🚀
Introducing Self-Reset Policy Optimization (SRPO): i) localize the first wrong reasoning step, ii) reset to that step, iii) learn from counterfactual continuations from there – no external supervision.🧵 https://t.co/A1KHt2CRCF

13

200

32

204

53K

AIonGradFlow retweeted

Ben Recht @beenwrekt

3 days ago

My teacher and friend Dimitri Bertsekas passed away earlier this month. I wrote about his broad contributions to the field of numerical optimization and his deep impact on my writing and research. https://t.co/CIDI9C31Tj

0

152

34

47

14K

Who to follow

へーくま / takmu

@takmu53

東大パズ同 WPC''25 B代表, 新人1位, 全��6位 WPC''26 A代表たくあん漬け / B4UT 2nd Arcaea13.24 / 終末シメジ階層

Skanda Vaidyanath

@skanda_vaid

ML @yutori_ai | Formerly at @riotgames @Stanford @MSFTResearch @GoogleDeepmind

sWIfT

@sWIfT1100

math; games; puzzles

AIonGradFlow retweeted

Simon Weber @SimWeberTUM

10 days ago

What if attention wasn't about matching tokens, but operating in function space? Glad to share our #ICML2026 paper: 📄 Functional Attention: From Pairwise Affinities to Functional Correspondences w/ @Jiefang_Xiao @GaoMaolin @stevenygd Daniel Cremers 📄 https://t.co/rhn9NtwrBm

SimWeberTUM's tweet photo. What if attention wasn't about matching tokens, but operating in function space?

Glad to share our #ICML2026 paper:
📄 Functional Attention: From Pairwise Affinities to Functional Correspondences

w/ @Jiefang_Xiao @GaoMaolin @stevenygd Daniel Cremers
📄 https://t.co/rhn9NtwrBm https://t.co/8V3dbshHvt

17

1K

160

1K

119K

AIonGradFlow retweeted

xuanming zhang @xuanmingzhangai

3 days ago

1/8 🧠 Think the deepest layer of an LLM is always the best for output? Think again! Our latest paper by Qwen Team reveals the "Alignment Tax" hiding in your final layers. Post-training can violently perturb terminal tokens away from rigorous logic! 🧵 ↓

xuanmingzhangai's tweet photo. 1/8 🧠 Think the deepest layer of an LLM is always the best for output? Think again! Our latest paper by Qwen Team reveals the "Alignment Tax" hiding in your final layers. Post-training can violently perturb terminal tokens away from rigorous logic! 🧵 ↓ https://t.co/4zMN1Vr8L0

12

118

21

84

9K

AIonGradFlow retweeted

Jayden Teoh

@jayden_teoh_

4 days ago

I really like this paper on why predicting latents is more data-efficient than predicting tokens. In Next-Latent Prediction Transformer (https://t.co/d0Jzdduhtk), we provide a different argument on why predicting your own next latent can improve data-efficiency. Predicting the next latent h_{t+1} gives you a rich learning signal about ALL future latents, and future tokens. This is because h_{t+1} is trained to predict h_{t+2}, h_{t+2} is trained to predict h_{t+3}, and so on... As such, predicting the next latent gives denser gradient signals than one-hot token predictions :) I talk more about this in my blog: https://t.co/jSsDYndDn8

2

508

53

505

48K

AIonGradFlow retweeted

WeiboLLM @WeiboLLM

10 days ago

⭐ VibeThinker-3B is released — a dense 3B model for frontier-level verifiable reasoning. 🚀 Reasoning: 94.3 on AIME’26, 76.4 on IMO-AnsBench, and 80.2 Pass@1 on LCB v6; with CLR, AIME‘26 improves to 97.1 and IMO-AnsBench to 80.6. 💻 OOD Coding: On recent unseen LeetCode weekly contests, VibeThinker-3B passes 123/128 (96.1%) first-attempt Python submissions. ⚡ Efficiency: Only 3B parameters, yet reaching the performance range of much larger top-tier reasoning models. 🧠 Perspective: Small models are not just cheaper substitutes. In parameter-dense domains with clear verification signals, SLMs offer a path to frontier-level reasoning that complements traditional Scaling Law. Model : https://t.co/94A14zpqCV Github: https://t.co/32so5P6C7L Paper: https://t.co/UDd264RsZb #AI #LLM #Reasoning #OpenSource #SmallModel

WeiboLLM's tweet photo. ⭐ VibeThinker-3B is released — a dense 3B model for frontier-level verifiable reasoning.

🚀 Reasoning: 94.3 on AIME’26, 76.4 on IMO-AnsBench, and 80.2 Pass@1 on LCB v6; with CLR, AIME‘26 improves to 97.1 and IMO-AnsBench to 80.6.

💻 OOD Coding: On recent unseen LeetCode weekly contests, VibeThinker-3B passes 123/128 (96.1%) first-attempt Python submissions.

⚡ Efficiency: Only 3B parameters, yet reaching the performance range of much larger top-tier reasoning models.

🧠 Perspective: Small models are not just cheaper substitutes. In parameter-dense domains with clear verification signals, SLMs offer a path to frontier-level reasoning that complements traditional Scaling Law.

Model : https://t.co/94A14zpqCV
Github: https://t.co/32so5P6C7L
Paper: https://t.co/UDd264RsZb

#AI #LLM #Reasoning #OpenSource #SmallModel

49

1K

150

881

95K

AIonGradFlow retweeted

Alex Smola

@smolix

12 days ago

Next week I am teaching a tutorial on efficient LLM inference at the Machine Learning Summer School 2026 in NYC, hosted this year at Columbia University. The slides are below. There are about 150 of them, which sounds small, given how far the field has come.

smolix's tweet photo. Next week I am teaching a tutorial on efficient LLM inference at the Machine Learning Summer School 2026 in NYC, hosted this year at Columbia University. The slides are below. There are about 150 of them, which sounds small, given how far the field has come. https://t.co/c3F7rtkdy6

7

210

23

176

11K

AIonGradFlow retweeted

Pratyush Kumar

@pratykumar

13 days ago

Ok, so here is my take on the Fable ban, sovereign AI, Sarvam, etc. The event is interesting as it has implications from many perspectives. For AI users, it is clear that you should not confuse access with ownership, or adoption itself as advantage. And if the most significant tech differentiator you are leveraging has external control loops, then you have to accept you are vulnerable. For AI talent, it is now a precedent that you would be *seen* aligning to national interests more than company interests. And even if its just a whim for now, this trend will be hard to reverse as the world gets more automated… For AI labs, their offerings will be stratified - general purpose AI would be available as utility, but frontier AI would be gated. This is a fantastic business model for labs - *democratized* AI sucks in all the data liquidity of the world which is locked in higher margin frontier offerings. I think for the world to be a better place, all three of the above are bad vectors. We need to have more countries and companies owning their own destinies. And in the post AI world, that means being able to use and improve AI systems within their own perimeters - what one may call Sovereign AI. At Sarvam, Sovereign AI in India was the founding thesis a couple of years back, and continues to remain the core operating principle. From our vantage point, it is super clear that India will build, leverage, and create massive business value and societal impact with sovereign AI. The following is precisely how we at Sarvam are contributing to make that happen.

84

2K

303

422

175K

AIonGradFlow retweeted

Sanjeev Arora

@prfsanjeevarora

14 days ago

Really like this simple and elegant idea for self-improvement where a reward model is available. No RL needed, just on-policy distillation loss where teacher = same LLM as student, but teacher is given student's answer and its scalar reward.

1

55

5

62

12K

AIonGradFlow retweeted

Yujia Zheng

@YujiaZheng9

17 days ago

Given a generalist model, how do we turn it into a specialist for the task we care about? We give a *provable* answer in our ICML 26 paper with @ffeng01, Yuke Li, Shaoan Xie, @sirbayes , and @kunkzhang https://t.co/SEYJ8i4cfZ

YujiaZheng9's tweet photo. Given a generalist model, how do we turn it into a specialist for the task we care about?

We give a *provable* answer in our ICML 26 paper with @ffeng01, Yuke Li, Shaoan Xie, @sirbayes , and @kunkzhang

https://t.co/SEYJ8i4cfZ https://t.co/Gn2MTmiK3M

4

143

19

161

36K

AIonGradFlow retweeted

Li Jiang

@louieworth

17 days ago

OPD is on-policy, but its supervision is still post-hoc and one-step. The student generates a rollout. The teacher then supervises that fixed trajectory token by token. Our new paper argues that this can fail at the wrong scale. When the prefix itself is broken, the problem is not only which tokens to reweight, clip, or truncate. The problem is the trajectory. We propose Trajectory-Refined Distillation (TRD): refine the student rollout under teacher guidance before distillation. with @ryanxhr Yichuan Ding @yayitsamyzhang Paper: https://t.co/DKkKL5zj69

louieworth's tweet photo. OPD is on-policy, but its supervision is still post-hoc and one-step.

The student generates a rollout. The teacher then supervises that fixed trajectory token by token.

Our new paper argues that this can fail at the wrong scale.

When the prefix itself is broken, the problem is not only which tokens to reweight, clip, or truncate. The problem is the trajectory.

We propose Trajectory-Refined Distillation (TRD): refine the student rollout under teacher guidance before distillation.

with @ryanxhr Yichuan Ding @yayitsamyzhang

Paper: https://t.co/DKkKL5zj69

4

93

12

80

5K

AIonGradFlow retweeted

Li Jiang

@louieworth

17 days ago

New blog post: On-Policy Distillation — Promise, Pitfalls, and Prospects. OPD combines on-policy rollouts with dense teacher supervision. But it is not a free lunch. I discuss three failure modes and introduce our new paper. https://t.co/xU35CqoMi3

louieworth's tweet photo. New blog post: On-Policy Distillation — Promise, Pitfalls, and Prospects.

OPD combines on-policy rollouts with dense teacher supervision.

But it is not a free lunch.

I discuss three failure modes and introduce our new paper.

https://t.co/xU35CqoMi3 https://t.co/24ElhIhgje

9

310

44

381

20K

AIonGradFlow retweeted

Aditi Raghunathan

@AdtRaghunathan

18 days ago

We've always intuited that verification is easier than generation. Chen's new work shows that explicitly training for it unlocks massive self-improvement: 📈 14× boost in test-time refinement on hard reasoning 🚀 30% gain beyond the RL plateau at training time

2

151

14

123

18K

AIonGradFlow retweeted

Mark Schmidt @MarkSchmidtUBC

20 days ago

I am sad to hear of the passing of Dimitri Bertsekas. This one hurts. Dimitri had a big effect on my career, from inspiring research topics to writing one of my tenure letters. A long thread on memories of Bertsekas and some of his works that influenced me the most.

MarkSchmidtUBC's tweet photo. I am sad to hear of the passing of Dimitri Bertsekas. This one hurts.

Dimitri had a big effect on my career, from inspiring research topics to writing one of my tenure letters.

A long thread on memories of Bertsekas and some of his works that influenced me the most. https://t.co/2ytJkkMbOy

4

264

32

65

14K

AIonGradFlow retweeted

Chelsea Finn

@chelseabfinn

19 days ago

Scaling RL to long horizons remains a major challenge. Long-horizon Q-learning (LQL) prevents compounding bootstrapping errors by bounding the difference in value over long horizons. It shows large gains over 1-step TD and n-step returns! Paper: https://t.co/OTk3M6cz8p

chelseabfinn's tweet photo. Scaling RL to long horizons remains a major challenge.

Long-horizon Q-learning (LQL) prevents compounding bootstrapping errors by bounding the difference in value over long horizons.

It shows large gains over 1-step TD and n-step returns!

Paper: https://t.co/OTk3M6cz8p https://t.co/kwOGH4algI

7

495

50

387

59K

AIonGradFlow retweeted

Yifan Zhang

@yifanzhang_

22 days ago

Introducing Self-Distilled Policy Gradient. Token-level rewards, credit assignment, self-distillation. RL and distillation are converging toward the same idea: Policy gradients, it always has been, it always will be. https://t.co/RJeRFUTeyz

yifanzhang_'s tweet photo. Introducing Self-Distilled Policy Gradient.

Token-level rewards, credit assignment, self-distillation.

RL and distillation are converging toward the same idea:

Policy gradients, it always has been, it always will be.

https://t.co/RJeRFUTeyz https://t.co/frNpVjyPW3

5

747

93

656

84K

AIonGradFlow retweeted

Hanna Hajishirzi

@HannaHajishirzi

24 days ago

MAI-Thinking-1 is out! Excited to share what we are building and how climbing from scratch (no distillation) actually works: simple recipes, rigorous science, self-distillation, patience, and great infra. Check out our tech report has the full story of our RL climbs. https://t.co/aLW40sWz4d

HannaHajishirzi's tweet photo. MAI-Thinking-1 is out!

Excited to share what we are building and how climbing from scratch (no distillation) actually works: simple recipes, rigorous science, self-distillation, patience, and great infra.

Check out our tech report has the full story of our RL climbs.
https://t.co/aLW40sWz4d

24

873

127

383

131K

AIonGradFlow retweeted

Guowei Xu

@Kevin_GuoweiXu

29 days ago

🚀 How should LLMs sample on hard reasoning problems during post-training and inference where direct rollouts rarely produce a correct answer? Best-of-N (e.g., GRPO) and tree search share two limitations: 🔻 Verification signals are sparse 🔻 Candidates stay within the model's own distribution We introduce BES: Bidirectional Evolutionary Search — a search framework that couples forward candidate evolution with backward goal decomposition. ✅ Works for both post-training and inference.

15

690

113

759

243K

Arun Iyer

@AIonGradFlow

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users