Alexandre Ramé

@ramealexandre

Senior research scientist @GoogleDeepMind. Previously PhD @Sorbonne_Univ_. Post-training Gemma LLMs: distillation, RL and merging.

Joined May 2011

782 Following

2K Followers

883 Posts

Pinned Tweet

Alexandre Ramé @ramealexandre

about 1 year ago

Welcome Gemma 3, our new open-weight LLM from @GoogleDeepMind. All sizes (1B, 4B, 12B and 27B) excel on benchmarks, but the key result may be the 27B reaching 1338 on LMSYS. For this, we scaled post-training, with our novel distillation, RL and merging strategies. Happy building!

ramealexandre's tweet photo. Welcome Gemma 3, our new open-weight LLM from @GoogleDeepMind. All sizes (1B, 4B, 12B and 27B) excel on benchmarks, but the key result may be the 27B reaching 1338 on LMSYS. For this, we scaled post-training, with our novel distillation, RL and merging strategies. Happy building! https://t.co/QMz1o3e58W

5

200

24

25

21K

ramealexandre retweeted

Oleksii Kuchaiev

about 6 hours ago

Our post-training pipeline is a substantial redesign from Super. The core idea: don't rely on stacked RL stages alone. We do SFT, multi-environment RLVR across a huge mix of agentic/reasoning/code/safety environments, then Multi-teacher On-Policy Distillation (MOPD). 10+ domain-specialized teachers, merged into the student via dense token-level guidance on its own rollouts. See Figures below for overview and tech report for all the details. 2/4

kuchaev's tweet photo. Our post-training pipeline is a substantial redesign from Super.
The core idea: don't rely on stacked RL stages alone. We do SFT, multi-environment RLVR across a huge mix of agentic/reasoning/code/safety environments, then Multi-teacher On-Policy Distillation (MOPD). 10+ domain-specialized teachers, merged into the student via dense token-level guidance on its own rollouts. See Figures below for overview and tech report for all the details. 2/4

3

165

18

125

44K

ramealexandre retweeted

Andreas Steiner @AndreasPSteiner

1 day ago

Gemma 4 12B in action: Object detection, function calling, voice command, segmentation, language switch, translation - all of this and much more without vision/audio encoders! (Inputs and outputs are real, but FC2 data shown as code, and generation speedified)

2

45

8

17

9K

ramealexandre retweeted

1 day ago

Meet Gemma 4 12B! A unified, encoder-free multimodal model designed to bring high-performance intelligence directly to your laptop, and released under an Apache 2.0 license. Bridging the gap between edge efficiency and advanced reasoning. Here is what’s new with Gemma 4 12B: 👇

googlegemma's tweet photo. Meet Gemma 4 12B!

A unified, encoder-free multimodal model designed to bring high-performance intelligence directly to your laptop, and released under an Apache 2.0 license.

Bridging the gap between edge efficiency and advanced reasoning. Here is what’s new with Gemma 4 12B: 👇 https://t.co/gf4FZv0WZb

354

11K

2K

5K

3M

Who to follow

We are a research team on artificial intelligence for automotive applications working toward assisted and autonomous driving.

Arthur Douillard

Verified account

Distributed Learning @ deepmind | DiLoCo, DiPaCo. Continual Learning PhD @ Sorbonne

Verified account

building tings bench: 225x4 overhead press: 145x1 squat/DL: 0 (skip)

ramealexandre retweeted

Kunhao Zheng @KunhaoZ

7 days ago

🧵 For 2 RL checkpoints trained differently, you can just weight extrapolate them and it works! Bonus: these extrapolated checkpoints are complementary policies -> Get exploration and diversity for free -> Better inference scaling when ensembling Paper: https://t.co/zU0LH0TOdm

3

121

29

76

13K

ramealexandre retweeted

12 days ago

I think everyone basically agrees you need some sort of filtering (exactly what is the question) over soft target distillation. The problem is that token space is the wrong domain imo. Look at the sample below, even as a human I cannot tell you which tokens deserve to be upweighted (What do you mean we need to upweight "P" and "Ban"). The behaviour change we want lies several levels implicitly below token space. How do we expect an LLM judge to be able to filter? The current prompt literally provides the divergent token index and the respective log probs. Even if you gave me that information, I won't even know what to do with it. I don't know the solution but its definitely not token space credit assignment or masking imo

nrehiew_'s tweet photo. I think everyone basically agrees you need some sort of filtering (exactly what is the question) over soft target distillation. The problem is that token space is the wrong domain imo.

Look at the sample below, even as a human I cannot tell you which tokens deserve to be upweighted (What do you mean we need to upweight "P" and "Ban"). The behaviour change we want lies several levels implicitly below token space. How do we expect an LLM judge to be able to filter?

The current prompt literally provides the divergent token index and the respective log probs. Even if you gave me that information, I won't even know what to do with it.

I don't know the solution but its definitely not token space credit assignment or masking imo

0

25

2

11

6K

Alexandre Ramé @ramealexandre

13 days ago

@bellmantd @RyanBoldi Thanks for the highlight, but this seems quite different! Notably in rewarded soups we have multiple policies (1 for each reward), so it's more costly, but perhaps more flexible at deployment time :)

0

6

1

0

197

ramealexandre retweeted

Logan Thorneloe

@loganthorneloe

14 days ago

This is a great and short watch for anyone curious about post-training. Very well explained and easy to understand.

0

10

2

5

2K

ramealexandre retweeted

15 days ago

Anti-Self-Distillation for Reasoning RL Invert the divergence. Preserving deliberation tokens like "Wait" and "Maybe" instead of template parroting leads to 2-10x faster convergence and +11.5 points on AIME/HMMT across 4B-30B models.

HuggingPapers's tweet photo. Anti-Self-Distillation for Reasoning RL

Invert the divergence.

Preserving deliberation tokens like "Wait" and "Maybe" instead of template parroting leads to 2-10x faster convergence and +11.5 points on AIME/HMMT across 4B-30B models. https://t.co/CVwJvbTz3s

1

68

18

52

5K

ramealexandre retweeted

17 days ago

Thank you for sharing RLRT! @ar0cket1 RLRT takes different approaches to correct and incorrect rollouts: - Correct rollouts: reverse OPSD - Incorrect rollouts: GRPO However, as introduced, even full reverse OPSD shows faster performance improvements in the early stages, although it becomes unstable later on. Interestingly!

0

23

3

19

3K

ramealexandre retweeted

fly51fly @fly51fly

22 days ago

[CL] A Study on Hidden Layer Distillation for Large Language Model Pre-Training M Guigon, L Dixon, M E. Sander [Google DeepMind] (2026) https://t.co/38OTfJgcce

fly51fly's tweet photo. [CL] A Study on Hidden Layer Distillation for Large Language Model Pre-Training
M Guigon, L Dixon, M E. Sander [Google DeepMind] (2026)
https://t.co/38OTfJgcce https://t.co/KEBE6NoKFT

fly51fly's tweet photo. [CL] A Study on Hidden Layer Distillation for Large Language Model Pre-Training
M Guigon, L Dixon, M E. Sander [Google DeepMind] (2026)
https://t.co/38OTfJgcce https://t.co/KEBE6NoKFT

fly51fly's tweet photo. [CL] A Study on Hidden Layer Distillation for Large Language Model Pre-Training
M Guigon, L Dixon, M E. Sander [Google DeepMind] (2026)
https://t.co/38OTfJgcce https://t.co/KEBE6NoKFT

fly51fly's tweet photo. [CL] A Study on Hidden Layer Distillation for Large Language Model Pre-Training
M Guigon, L Dixon, M E. Sander [Google DeepMind] (2026)
https://t.co/38OTfJgcce https://t.co/KEBE6NoKFT

0

35

6

34

3K

ramealexandre retweeted

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)

24 days ago

> Gemma 4 is a high-level chess grandmaster > GPT 5.5 pro is close to Stockfish level 7

28

380

10

65

61K

ramealexandre retweeted

Cas (Stephen Casper)

@StephenLCasper

27 days ago

I'm definitely against this specific method. This particular method has no principled way whatsoever of guaranteeing that the English text inside the autoencoder reliably and faithfully corresponds to what the model is "actually thinking." Optimizing their encoder and decoder jointly for reconstruction does not put any optimization pressure on anything related to making sure that the intermediate text between the encoder and decoder has the same meaning to the decoder as its meaning in English. The fact that they had to use a KL divergence penalty calculated using a text model to make sure that the English intermediates were readable is pretty strong evidence that their method is fundamentally ill-equipped to produce faithful explanations. This issue with this autoencoder approach is not something that you could solve by modifying the method. It would be a category error to have optimism about natural language autoencoders being reliably able to produce faithful explanations.

1

16

2

4

1K

ramealexandre retweeted

27 days ago

I think automatic interp has enormous potential, but it is still immature in its current forms (and could give false illusions that we magically solved interp for ones that haven't played with such tools). Relevant criticism on AO https://t.co/6xNlsIkutg https://t.co/vZI8OaPYUA

0

54

4

34

7K

ramealexandre retweeted

about 1 month ago

We developed a unified theory of generalization in deep learning. It explains grokking, double descent, benign overfitting, and implicit bias. But theory is only half the story. It turns out that optimizing the population risk of any neural network amounts to a small change to your optimizer. 🧵

elon_lit's tweet photo. We developed a unified theory of generalization in deep learning. It explains grokking, double descent, benign overfitting, and implicit bias.

But theory is only half the story. It turns out that optimizing the population risk of any neural network amounts to a small change to your optimizer. 🧵

21

996

127

1K

75K

Alexandre Ramé @ramealexandre

about 1 month ago

@natolambert disteallation ?

0

1

0

0

314

ramealexandre retweeted

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)

about 1 month ago

I guess another reason DeepSeek goes to such lengths for multi-teacher OPD is it's substantially more natural to RLmaxx tasks with mutiple objectives (correctness, format, CoT faithfulness increasing length penalty) in a narrow domain, than just GRPO-on-everything.

1

33

2

12

5K

ramealexandre retweeted

Cameron R. Wolfe, Ph.D.

@cwolferesearch

about 1 month ago

Gemma-4 has received more attention than prior generations of Gemma, but I still somehow feel like these models are so underrated. I don't think people fully realize how good Gemma-4 models are, especially for their size.

27

169

6

22

15K

ramealexandre retweeted

Cameron R. Wolfe, Ph.D.

@cwolferesearch

about 1 month ago

Recently read a great overview of multi-teacher on-policy distillation (MOPD) and how it was used in recent LLMs like MiMo-V2-Flash, GLM-5, Nemotron-Cascade 2, and DeepSeek-V4… What is on-policy distillation (OPD)? The idea of OPD is simple. We have a student and a teacher. We sample trajectories from the student, then use reverse KL divergence as an objective to match the teacher’s log probability distribution along these trajectories. This training setup can be integrated into the GRPO loss by replacing group-relative advantage with reverse KL. Multi-teacher OPD (MOPD) extends this idea by having more than one teacher during OPD training. This idea is useful due to the domain-specific nature of RLVR training. If we train a model on math with RLVR, this may improve math performance but harm model quality on creative tasks. Similarly, RL training a model on tool use data could degrade performance on general-purpose benchmarks. To solve this see-saw problem, we can train domain-specific models with RL and use MOPD to distill them into a single student. Post-training with MOPD has become a common choice for recent models: - MiMo-V2-Flash: starts from a general SFT model and uses domain-specific models from across the post-training pipeline (i.e., SFT models, RL specialists, and the student itself) as teachers for MOPD at the final stage of post-training, where teachers are selected heuristically by domain. - GLM-5: starts from the final RL checkpoint produced by a sequential RL pipeline of reasoning, agentic, and general domains, using the final checkpoint of each stage as a teacher, where the teacher is again selected by the domain of the prompt. MOPD here aims to recover capabilities rather than merging them across domains. - Nemotron-Cascade 2: places MOPD at the mid-point of post-training as a stabilization step between stages. Three prior model checkpoints are chosen as teachers from prior training stages for math, RLHF, and multi-domain RL. - DeepSeek-V4: trains a very large number (10+) of domain experts independently using domain-specific SFT and RL, then distills all of them into a single student. This paper interestingly uses full vocabulary distillation, which has very high memory overhead and is infrastructurally complex, instead of approximating KL with a single logit. The blog also includes a great snippet about why self-distillation is a useful addition in MOPD: “Self is a snapshot of the student at the start of MOPD—a fixed, stable reference distribution. On tokens where the SFT/RL teachers push the student into unfamiliar territory, distilling toward Self prevents catastrophic drift.” Despite MOPD becoming quite common in several different reports, the approaches used are all quite similar (i.e., reverse KL, on-policy distillation, multiple teachers), indicating that OPD / MOPD is becoming a more standardized approach in training pipelines for recent models.

cwolferesearch's tweet photo. Recently read a great overview of multi-teacher on-policy distillation (MOPD) and how it was used in recent LLMs like MiMo-V2-Flash, GLM-5, Nemotron-Cascade 2, and DeepSeek-V4…

What is on-policy distillation (OPD)? The idea of OPD is simple. We have a student and a teacher. We sample trajectories from the student, then use reverse KL divergence as an objective to match the teacher’s log probability distribution along these trajectories. This training setup can be integrated into the GRPO loss by replacing group-relative advantage with reverse KL.

Multi-teacher OPD (MOPD) extends this idea by having more than one teacher during OPD training. This idea is useful due to the domain-specific nature of RLVR training. If we train a model on math with RLVR, this may improve math performance but harm model quality on creative tasks. Similarly, RL training a model on tool use data could degrade performance on general-purpose benchmarks. To solve this see-saw problem, we can train domain-specific models with RL and use MOPD to distill them into a single student.

Post-training with MOPD has become a common choice for recent models:

- MiMo-V2-Flash: starts from a general SFT model and uses domain-specific models from across the post-training pipeline (i.e., SFT models, RL specialists, and the student itself) as teachers for MOPD at the final stage of post-training, where teachers are selected heuristically by domain.
- GLM-5: starts from the final RL checkpoint produced by a sequential RL pipeline of reasoning, agentic, and general domains, using the final checkpoint of each stage as a teacher, where the teacher is again selected by the domain of the prompt. MOPD here aims to recover capabilities rather than merging them across domains.
- Nemotron-Cascade 2: places MOPD at the mid-point of post-training as a stabilization step between stages. Three prior model checkpoints are chosen as teachers from prior training stages for math, RLHF, and multi-domain RL.
- DeepSeek-V4: trains a very large number (10+) of domain experts independently using domain-specific SFT and RL, then distills all of them into a single student. This paper interestingly uses full vocabulary distillation, which has very high memory overhead and is infrastructurally complex, instead of approximating KL with a single logit.

The blog also includes a great snippet about why self-distillation is a useful addition in MOPD: “Self is a snapshot of the student at the start of MOPD—a fixed, stable reference distribution. On tokens where the SFT/RL teachers push the student into unfamiliar territory, distilling toward Self prevents catastrophic drift.”

Despite MOPD becoming quite common in several different reports, the approaches used are all quite similar (i.e., reverse KL, on-policy distillation, multiple teachers), indicating that OPD / MOPD is becoming a more standardized approach in training pipelines for recent models.

2

245

35

214

23K

ramealexandre retweeted

Marius Hobbhahn

@MariusHobbhahn

about 1 month ago

I think this is bad because it makes it much harder to track misalignment, especially deceptive alignment.

7

153

12

43

25K

ramealexandre retweeted

Alexis Limozin @AlexisLimozin

about 1 month ago

We found two bugs in DeepSpeed and OpenRLHF that deflated SFT baselines in several recent mixed-policy RL papers. Once fixed, plain SFT-then-RL beats or matches every mixed-policy method we tested.

AlexisLimozin's tweet photo. We found two bugs in DeepSpeed and OpenRLHF that deflated SFT baselines in several recent mixed-policy RL papers.

Once fixed, plain SFT-then-RL beats or matches every mixed-policy method we tested. https://t.co/jhE9fkGZ7j

4

56

11

27

22K

Last Seen Users on Sotwe

Trends for you

Most Popular Users