Rishabh Agarwal @agarwl_ - Twitter Profile

Pinned Tweet

9 months ago

This is my last week at @AIatMeta. It was a tough decision not to continue with the new Superintelligence TBD lab, especially given the talent and compute density. But after 7.5 years across Google Brain, DeepMind, and Meta, I felt the pull to take on a different kind of risk. The pitch from Mark and @alexandr_wang to build in the Superintelligence team was incredibly compelling. But I ultimately choose to follow Mark's own advice: “In a world that’s changing so fast, the biggest risk you can take is not taking any risk”. In my short time at Meta, we did push the frontier on post-training for "thinking" models. Specifically: - Pushing an 8B dense model to near Deepseek-R1 performance with RL scaling. - Using synthetic data mid-training to warm-start RL. - Developing better on-policy distillation methods. Really enjoyed working with @_arohan_, @brandfonbrener, Leo Li, @ErykHelenowski, @DatHuynh13, Xiaocheng, Jia, Boduo, and Yanjun.

154

3K

84

591

461K

agarwl_ retweeted

Xiuyu Li

@sheriyuo

about 9 hours ago

I have been reading a lot of recent work on self-evolving agents, and it really feels like the skill papers are all converging in this direction. For training-free memory or skill self-evolution, are there any frontier papers worth reading? Has anyone already shown a hard ceiling for this line of work, meaning RL is still the only way forward? A recent paper, "Learning, Fast and Slow: Towards LLMs That Adapt Continually," seems to take a different stance. It is not trying to settle the theory, it is basically saying, I want both. https://t.co/wqVetHPhvd

0

59

8

45

4K

agarwl_ retweeted

Alexey Gorbatovski @AMyashka

2 days ago

[1/7] OPD has a simple post-training loop: sample from the student, label with the teacher, repeat. The awkward part is the start. The first rollouts come from the weakest version of the student, and training begins there. TRB Paper: https://t.co/EPjiHZIE3s

5

109

23

103

10K

Rishabh Agarwal

@agarwl_

3 days ago

@pradheepraop One realization i had a couple of years ago was pure on-policy distillation works reliably, and hence the name of the paper.

1

15

0

13

2K

Who to follow

Marc G. Bellemare

@marcgbellemare

Modelling @ Cohere. Ex RL research lead at Google Brain, DeepMind. Textbook author. Co-founder, Reliant AI.

Jakob Foerster

@j_foerst

Associate Prof in ML @UniofOxford. Something Something Research Scientist @MetaAI. Something @FLAIR_Ox. Always #teamhuman. Opinions belong to the world.

Pulkit Agrawal

@pulkitology

Co-Founder @EkaRobotics, Faculty @MIT

Rishabh Agarwal

@agarwl_

3 days ago

@leavittron Every dog has its day

0

16

0

998

Rishabh Agarwal

@agarwl_

5 days ago

@kevinroose

0

32

0

3K

Rishabh Agarwal

@agarwl_

5 days ago

Someone once told me: "You should be the last one to reinvent something" -- not sure how useful this is, but this is a common occurrence in science. It is true that frontier AI labs have innovations that are often simultaneous / re-discovered by academic labs. However, folks outside those labs have no way of knowing about those innovations and their only source of reference would be the work shared openly.

7

193

11

47

31K

Rishabh Agarwal

@agarwl_

5 days ago

@chopwatercarry Yeah I think so

0

1

0

1

77

Rishabh Agarwal

@agarwl_

6 days ago

Speculative OPD addresses this exact issue in OPD that student distribution can sometimes be too far from the teacher to provide useful feedback. https://t.co/uca9fw9mxp

Omar Khattab

@lateinteraction

7 days ago

extremely informal rant: on-policy distillation is so awkward and frankly just super overrated. why so? well, you'd absolutely hate to be the teacher in an OPD or OPSD setting. imagine trying to teach an aspiring undergrad how to do research by... just asking them to do it, and then passively watching them wander for countless hours doing something bogus. after they're completely done, your only tool is to replay their bogus trajectory as-is and offer them 1 token of correction starting from every (rather unhelpful) state they arrive at. or imagine trying to teach someone how to drive to the nearest Target. so you throw them into a car and ask them to do that, and you just... let them mess around in random directions. after they're done, you can't actually help them drive anywhere, you're just offering 1-step of 10-millisecond steering guidance for them to distill, from every (bad) state they arrived at! in OPD, the teacher is forced to stare at absolute nonsense attempts and can't course-correct at all. i can believe this to work in cases where the problems with trajectories are rather sparse and repetitive, and it *is* better than on-policy RL in many such cases. but i think Pedagogical RL, which is a form of "controllably off-policy" self-distillation is conceptually a much more powerful direction. the teacher's job is to actually instruct and to diverge from student's likely actions to the (smallest) necessary extent for success.

40

488

34

378

69K

4

157

8

162

20K

agarwl_ retweeted

Anjney Midha

@AnjneyMidha

6 days ago

today @CS153Systems, the students got to hear from @LiamFedus and @ekindogus about their search for a room temperature superconductor at @periodiclabs the kids will remember this one for the rest of their lives

AnjneyMidha's tweet photo. today @CS153Systems, the students got to hear from @LiamFedus and @ekindogus about their search for a room temperature superconductor at @periodiclabs

the kids will remember this one for the rest of their lives https://t.co/ccFHNtj4b7

5

224

10

60

17K

agarwl_ retweeted

Mark Saroufim

@marksaroufim

6 days ago

My MLSys keynote on AI writing systems code got more interest than I expected. The recording will take a while, so in the finest tradition of AI labs sharing blog posts, we’re starting the Core Automation Blog with this one https://t.co/h4uSOyrglf

23

637

68

589

171K

Rishabh Agarwal

@agarwl_

6 days ago

https://t.co/ro4DpMAzyd (from @WendaXu2)

0

19

1

19

2K

agarwl_ retweeted

Niels Rogge @NielsRogge

9 days ago

One of the hottest terms in AI right now is "On-policy distillation". It is a post-training technique in which a student model, typically an LLM, samples from its current policy and receives a teacher signal for on-policy states. It combines the dense supervision of distillation with the locality of online RL. Now a method on PapersWithCode! Find all 183 papers that cite it, and more here: https://t.co/NIsUjyU3UP

NielsRogge's tweet photo. One of the hottest terms in AI right now is "On-policy distillation".

It is a post-training technique in which a student model, typically an LLM, samples from its current policy and receives a teacher signal for on-policy states. It combines the dense supervision of distillation with the locality of online RL.

Now a method on PapersWithCode!

Find all 183 papers that cite it, and more here: https://t.co/NIsUjyU3UP

21

1K

128

1K

84K

Rishabh Agarwal

@agarwl_

9 days ago

@NielsRogge Nice! It's missing more than half of the follow-up papers based on my Google scholar: https://t.co/pJ1m5i7tIx

1

8

0

12

3K

Rishabh Agarwal

@agarwl_

11 days ago

The quest for reliable on-policy self-distillation continues. Hope something would stand the test of time.

Applied Compute @appliedcompute

12 days ago

Some enterprise tasks are challenging to hill-climb with RL-based methods since they involve very out-of-distribution behavior. On-policy self-distillation (OPSD) gives a model learning signal for every token it writes, far richer than the single scalar reward of RL. But that channel is noisy: most tokens don't reflect the behavior you're after. We introduce Relevance-Masked Self-Distillation (RMSD), which uses a two-step filtered loss mask to cut through the noise and find the tokens with the highest signal. Compared to OPSD it trains more stably, provides higher data efficiency, and reaches a higher performance ceiling.

9

295

27

285

85K

2

132

7

92

17K

Rishabh Agarwal

@agarwl_

12 days ago

Very well written blog. I think of RL as learning from interventions, and it kinda explains why it's more powerful as a paradigm than supervised learning. Now learning from counterfactuals is something we haven't been historically good at but maybe world modelling+ RL can get us there.

Vishal Misra @vishalmisra

13 days ago

@ShriramKMurthi @Hesamation Have you read this https://t.co/tabFRNNsbD

11

188

19

376

117K

6

355

21

502

68K

agarwl_ retweeted

(((ل()(ل() 'yoav))))👾

@yoavgo

13 days ago

this is a clear demonstration of:

9

139

6

25

20K

Rishabh Agarwal

@agarwl_

13 days ago

@eliebakouch @_lewtun SFT / RL data separate good

1

20

0

7

1K

Rishabh Agarwal

@agarwl_

14 days ago

@marcgbellemare @sivareddyg @reliant_ai @cohere Congrats!

0

2

0

433

Rishabh Agarwal

@agarwl_

16 days ago

Perplexed by this take: Sure, let's not mainly do supervise learning on human knowledge, but it makes sense to build off it instead of the *let's do it from scratch*. People cite AlphaGo vs AlphaGo Zero as a quintessential example of how using human-generating data is suboptimal but it was *imitating* it that was suboptimal. What if we learned from that data assuming it was suboptimal in the first place (so not supervised learning but RL like mindset of using that data)

Richard Sutton

@RichardSSutton

16 days ago

The bitter lesson in 26 words: Don’t be distracted by human knowledge, as AI has been historically. Instead focus on methods for creating knowledge that scale with computation, like search and learning.

136

7K

973

3K

570K

6

136

1

70

27K

Rishabh Agarwal

@agarwl_

18 days ago

For the love of the game

rohan anil

@_arohan_

18 days ago

We did research when pay was low. We did research when pay was uncertain. We did research even when we were lucky enough to be paid well. One way is to figure out what to work on is to work on things that matter and not think of rewards. We are still quite early into what makes a frontier model all the way from optimization, architecture and objectives. Big token wants to convince you otherwise.

35

1K

60

278

200K

1

123

2

8

12K

Rishabh Agarwal

@agarwl_

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users