Sarthak Mittal @sarthmit - Twitter Profile

Pinned Tweet

about 2 months ago

Is distribution sharpening actually the future of scaling, or just a massive hype train? 📉 We put it to the test using an RL framework – simulating everything from sharpening to task reward optimization. Result: It’s not the silver bullet everyone thinks it is!

sarthmit's tweet photo. Is distribution sharpening actually the future of scaling, or just a massive hype train? 📉

We put it to the test using an RL framework – simulating everything from sharpening to task reward optimization.

Result: It’s not the silver bullet everyone thinks it is! https://t.co/R6hFqSRLHV

1

12

7

6

2K

Sarthak Mittal

@sarthmit

3 days ago

I always feel more people should know this

Taco Cohen

@TacoCohen

3 days ago

@yoavgo As it turns out, the KL regularized return maximization objective is exactly the ELBO from variational inference. One is forced to REINFORCE because you can’t use the reparameterization trick, but other than that it’s a VAE where action / reasoning tokens are the latents.

2

38

0

41

13K

0

14

1

14

7K

sarthmit retweeted

Oleksii Kuchaiev

@kuchaev

3 days ago

Our post-training pipeline is a substantial redesign from Super. The core idea: don't rely on stacked RL stages alone. We do SFT, multi-environment RLVR across a huge mix of agentic/reasoning/code/safety environments, then Multi-teacher On-Policy Distillation (MOPD). 10+ domain-specialized teachers, merged into the student via dense token-level guidance on its own rollouts. See Figures below for overview and tech report for all the details. 2/4

kuchaev's tweet photo. Our post-training pipeline is a substantial redesign from Super.
The core idea: don't rely on stacked RL stages alone. We do SFT, multi-environment RLVR across a huge mix of agentic/reasoning/code/safety environments, then Multi-teacher On-Policy Distillation (MOPD). 10+ domain-specialized teachers, merged into the student via dense token-level guidance on its own rollouts. See Figures below for overview and tech report for all the details. 2/4

7

269

36

226

92K

sarthmit retweeted

Dane Malenfant

@dvnxmvl_hdf5

10 days ago

🚨Excited to announce our workshop Context Beyond the Window hosted at COLM in SF! 🚨 LLMs have finite context windows, yet real-world tasks demand absorbing, retaining, and acting on information that far exceeds any single prompt. 1/3 We're looking for submissions across: https://t.co/6y1ILeeC9A • Context compression 🧃 — token compaction, recursive subagent calls, and external memory for storing and retrieving information • Efficient architectures 🚀 — sub-quadratic attention variants that make extremely long context computationally feasible • Continual training 🌱 — test-time training on streaming data, context distillation, and knowledge accumulation through continued pre-training • Agentic memory systems 🐘 — scaffolds and test-time scaling techniques that improve knowledge retention and acquisition in LLMs • Evaluation 🎯 — benchmarking models on increasingly long-horizon tasks

dvnxmvl_hdf5's tweet photo. 🚨Excited to announce our workshop Context Beyond the Window hosted at COLM in SF! 🚨

LLMs have finite context windows, yet real-world tasks demand absorbing, retaining, and acting on information that far exceeds any single prompt.

1/3

We're looking for submissions across:

https://t.co/6y1ILeeC9A

• Context compression 🧃 — token compaction, recursive subagent calls, and external memory for storing and retrieving information
• Efficient architectures 🚀 — sub-quadratic attention variants that make extremely long context computationally feasible
• Continual training 🌱 — test-time training on streaming data, context distillation, and knowledge accumulation through continued pre-training
• Agentic memory systems 🐘 — scaffolds and test-time scaling techniques that improve knowledge retention and acquisition in LLMs
• Evaluation 🎯 — benchmarking models on increasingly long-horizon tasks

5

92

29

44

29K

Who to follow

Shiva

@ShivaSujit

Deep RL at @ArayaGlobal | Prev @MSFTResearch | MSc @Mila_Quebec in RL | BSc @ReachNITT

Arnab

@ArnabMondal96

ML Researcher @Apple  | PhD @mcgillu + @Mila_Quebec | Undergrad @IITKgp | Formerly: @MSFTResearch @ServiceNowRSRCH @samsungresearch

13 days ago

RT @AnjaSurina: AlphaProof Nexus advancing research math, solving 9 Erdős problems & more! Amazing experience to be part of this team & pro…

0

2

0

23

Sarthak Mittal

@sarthmit

16 days ago

@RyanBoldi Is it fair to call this optimizing vector-valued rewards, since in reality the reward is being reduced to a single number which is the weighted mean with expected weights?

2

4

0

584

sarthmit retweeted

Moksh Jain @JainMoksh

26 days ago

The scientific process involves collecting informative measurements while effectively allocating limited resources. We developed MaD-Physics, a new benchmark to measure this capability of agents.

1

38

17

26

6K

sarthmit retweeted

John Schulman

@johnschulman2

27 days ago

Sharing our work on full-duplex multimodal models -- real-time interaction that's natural and intuitive without compromising on intelligence. We started Thinky in part to differentially advance capabilities for human-AI collaboration, which are underemphasized relative to intelligence/autonomy because they're harder to eval. In the future, we think every AI system will have something like an interaction model as the outer user-facing layer, continually keeping the user informed and learning what they actually want.

36

929

84

181

123K

Sarthak Mittal

@sarthmit

28 days ago

@Harman26Singh RL from scratch would be really sample inefficient so I don’t think it will replace pretraining.

2

0

1

143

Sarthak Mittal

@sarthmit

about 1 month ago

Many congratulations to @FrankRHutter @noahholl, sauraj and the entire prior team!!

Prior Labs @prior_labs

about 1 month ago

Today we announced a major milestone: @prior_labs has entered into a definitive agreement to be acquired by @SAP, scaling Prior Labs to become the next frontier AI lab for structured data. 🧵

5

68

7

9

22K

0

1

0

1

284

Sarthak Mittal

@sarthmit

about 1 month ago

@offsetx0 I get the idea, its more a personal quirk that I find this nomenclature weird.

0

23

Sarthak Mittal

@sarthmit

about 1 month ago

Kind of weird that Gemma 4 2B model is actually 5B (including embeddings, guess that is why they say E2B) And here I was thinking I found something comparative to Qwen3 1.7B

1

5

0

1

412

Sarthak Mittal

@sarthmit

about 1 month ago

@Amank1412 And who really cares about arc-agi-3?

2

1

0

132

Sarthak Mittal

@sarthmit

about 1 month ago

@eliebakouch @himanshustwts I don’t think that is even feasible as a strategy, especially if you keep scaling model size. You def need pretraining.

0

77

Sarthak Mittal

@sarthmit

about 1 month ago

The real moat feels like data and compute; what do others think?

0

3

0

312

Sarthak Mittal

@sarthmit

about 2 months ago

We used the Nemo RL codebase to implement the RL training. Paper: https://t.co/HqDQNbVD3i Joint work with Leo and Guillaume. The setup is heavily inspired from https://t.co/8wAeDLvju1

0

2

0

143

Sarthak Mittal

@sarthmit

about 2 months ago

Is distribution sharpening actually the future of scaling, or just a massive hype train? 📉 We put it to the test using an RL framework – simulating everything from sharpening to task reward optimization. Result: It’s not the silver bullet everyone thinks it is!

1

12

7

6

2K

Sarthak Mittal

@sarthmit

about 2 months ago

Could this all have to do with RL-training instabilities and not distribution sharpening? Our training health checks highlight consistently improving reward, showing that the training methodology works fine, but the optimum is to blame.

sarthmit's tweet photo. Could this all have to do with RL-training instabilities and not distribution sharpening?

Our training health checks highlight consistently improving reward, showing that the training methodology works fine, but the optimum is to blame. https://t.co/VyIq62AwO3

1

0

147

Sarthak Mittal

@sarthmit

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users