This is my last week at @AIatMeta. It was a tough decision not to continue with the new Superintelligence TBD lab, especially given the talent and compute density. But after 7.5 years across Google Brain, DeepMind, and Meta, I felt the pull to take on a different kind of risk.
The pitch from Mark and @alexandr_wang to build in the Superintelligence team was incredibly compelling. But I ultimately choose to follow Mark's own advice: “In a world that’s changing so fast, the biggest risk you can take is not taking any risk”.
In my short time at Meta, we did push the frontier on post-training for "thinking" models. Specifically:
- Pushing an 8B dense model to near Deepseek-R1 performance with RL scaling.
- Using synthetic data mid-training to warm-start RL.
- Developing better on-policy distillation methods.
Really enjoyed working with @_arohan_, @brandfonbrener, Leo Li, @ErykHelenowski, @DatHuynh13, Xiaocheng, Jia, Boduo, and Yanjun.
I have been reading a lot of recent work on self-evolving agents, and it really feels like the skill papers are all converging in this direction.
For training-free memory or skill self-evolution, are there any frontier papers worth reading? Has anyone already shown a hard ceiling for this line of work, meaning RL is still the only way forward?
A recent paper, "Learning, Fast and Slow: Towards LLMs That Adapt Continually," seems to take a different stance. It is not trying to settle the theory, it is basically saying, I want both. https://t.co/wqVetHPhvd
[1/7] OPD has a simple post-training loop: sample from the student, label with the teacher, repeat.
The awkward part is the start. The first rollouts come from the weakest version of the student, and training begins there.
TRB Paper: https://t.co/EPjiHZIE3s
Someone once told me: "You should be the last one to reinvent something" -- not sure how useful this is, but this is a common occurrence in science.
It is true that frontier AI labs have innovations that are often simultaneous / re-discovered by academic labs.
However, folks outside those labs have no way of knowing about those innovations and their only source of reference would be the work shared openly.
Speculative OPD addresses this exact issue in OPD that student distribution can sometimes be too far from the teacher to provide useful feedback.
https://t.co/uca9fw9mxp
extremely informal rant: on-policy distillation is so awkward and frankly just super overrated.
why so? well, you'd absolutely hate to be the teacher in an OPD or OPSD setting.
imagine trying to teach an aspiring undergrad how to do research by... just asking them to do it, and then passively watching them wander for countless hours doing something bogus.
after they're completely done, your only tool is to replay their bogus trajectory as-is and offer them 1 token of correction starting from every (rather unhelpful) state they arrive at.
or imagine trying to teach someone how to drive to the nearest Target. so you throw them into a car and ask them to do that, and you just... let them mess around in random directions. after they're done, you can't actually help them drive anywhere, you're just offering 1-step of 10-millisecond steering guidance for them to distill, from every (bad) state they arrived at!
in OPD, the teacher is forced to stare at absolute nonsense attempts and can't course-correct at all. i can believe this to work in cases where the problems with trajectories are rather sparse and repetitive, and it *is* better than on-policy RL in many such cases.
but i think Pedagogical RL, which is a form of "controllably off-policy" self-distillation is conceptually a much more powerful direction.
the teacher's job is to actually instruct and to diverge from student's likely actions to the (smallest) necessary extent for success.
today @CS153Systems, the students got to hear from @LiamFedus and @ekindogus about their search for a room temperature superconductor at @periodiclabs
the kids will remember this one for the rest of their lives
My MLSys keynote on AI writing systems code got more interest than I expected. The recording will take a while, so in the finest tradition of AI labs sharing blog posts, we’re starting the Core Automation Blog with this one https://t.co/h4uSOyrglf
One of the hottest terms in AI right now is "On-policy distillation".
It is a post-training technique in which a student model, typically an LLM, samples from its current policy and receives a teacher signal for on-policy states. It combines the dense supervision of distillation with the locality of online RL.
Now a method on PapersWithCode!
Find all 183 papers that cite it, and more here: https://t.co/NIsUjyU3UP
Some enterprise tasks are challenging to hill-climb with RL-based methods since they involve very out-of-distribution behavior. On-policy self-distillation (OPSD) gives a model learning signal for every token it writes, far richer than the single scalar reward of RL.
But that channel is noisy: most tokens don't reflect the behavior you're after. We introduce Relevance-Masked Self-Distillation (RMSD), which uses a two-step filtered loss mask to cut through the noise and find the tokens with the highest signal. Compared to OPSD it trains more stably, provides higher data efficiency, and reaches a higher performance ceiling.
Very well written blog. I think of RL as learning from interventions, and it kinda explains why it's more powerful as a paradigm than supervised learning.
Now learning from counterfactuals is something we haven't been historically good at but maybe world modelling+ RL can get us there.
Perplexed by this take: Sure, let's not mainly do supervise learning on human knowledge, but it makes sense to build off it instead of the *let's do it from scratch*.
People cite AlphaGo vs AlphaGo Zero as a quintessential example of how using human-generating data is suboptimal but it was *imitating* it that was suboptimal.
What if we learned from that data assuming it was suboptimal in the first place (so not supervised learning but RL like mindset of using that data)
The bitter lesson in 26 words:
Don’t be distracted by human knowledge, as AI has been historically.
Instead focus on methods for creating knowledge that scale with computation, like search and learning.
We did research when pay was low.
We did research when pay was uncertain.
We did research even when we were lucky enough to be paid well.
One way is to figure out what to work on is to work on things that matter and not think of rewards.
We are still quite early into what makes a frontier model all the way from optimization, architecture and objectives. Big token wants to convince you otherwise.