Welcome Gemma 3, our new open-weight LLM from @GoogleDeepMind. All sizes (1B, 4B, 12B and 27B) excel on benchmarks, but the key result may be the 27B reaching 1338 on LMSYS. For this, we scaled post-training, with our novel distillation, RL and merging strategies. Happy building!
Our post-training pipeline is a substantial redesign from Super.
The core idea: don't rely on stacked RL stages alone. We do SFT, multi-environment RLVR across a huge mix of agentic/reasoning/code/safety environments, then Multi-teacher On-Policy Distillation (MOPD). 10+ domain-specialized teachers, merged into the student via dense token-level guidance on its own rollouts. See Figures below for overview and tech report for all the details. 2/4
Gemma 4 12B in action: Object detection, function calling, voice command, segmentation, language switch, translation - all of this and much more without vision/audio encoders!
(Inputs and outputs are real, but FC2 data shown as code, and generation speedified)
Meet Gemma 4 12B!
A unified, encoder-free multimodal model designed to bring high-performance intelligence directly to your laptop, and released under an Apache 2.0 license.
Bridging the gap between edge efficiency and advanced reasoning. Here is what’s new with Gemma 4 12B: 👇
🧵 For 2 RL checkpoints trained differently, you can just weight extrapolate them and it works!
Bonus: these extrapolated checkpoints are complementary policies
-> Get exploration and diversity for free
-> Better inference scaling when ensembling
Paper: https://t.co/zU0LH0TOdm
I think everyone basically agrees you need some sort of filtering (exactly what is the question) over soft target distillation. The problem is that token space is the wrong domain imo.
Look at the sample below, even as a human I cannot tell you which tokens deserve to be upweighted (What do you mean we need to upweight "P" and "Ban"). The behaviour change we want lies several levels implicitly below token space. How do we expect an LLM judge to be able to filter?
The current prompt literally provides the divergent token index and the respective log probs. Even if you gave me that information, I won't even know what to do with it.
I don't know the solution but its definitely not token space credit assignment or masking imo
@bellmantd@RyanBoldi Thanks for the highlight, but this seems quite different! Notably in rewarded soups we have multiple policies (1 for each reward), so it's more costly, but perhaps more flexible at deployment time :)
Anti-Self-Distillation for Reasoning RL
Invert the divergence.
Preserving deliberation tokens like "Wait" and "Maybe" instead of template parroting leads to 2-10x faster convergence and +11.5 points on AIME/HMMT across 4B-30B models.
Thank you for sharing RLRT! @ar0cket1
RLRT takes different approaches to correct and incorrect rollouts:
- Correct rollouts: reverse OPSD
- Incorrect rollouts: GRPO
However, as introduced, even full reverse OPSD shows faster performance improvements in the early stages, although it becomes unstable later on. Interestingly!
[CL] A Study on Hidden Layer Distillation for Large Language Model Pre-Training
M Guigon, L Dixon, M E. Sander [Google DeepMind] (2026)
https://t.co/38OTfJgcce
I'm definitely against this specific method. This particular method has no principled way whatsoever of guaranteeing that the English text inside the autoencoder reliably and faithfully corresponds to what the model is "actually thinking." Optimizing their encoder and decoder jointly for reconstruction does not put any optimization pressure on anything related to making sure that the intermediate text between the encoder and decoder has the same meaning to the decoder as its meaning in English. The fact that they had to use a KL divergence penalty calculated using a text model to make sure that the English intermediates were readable is pretty strong evidence that their method is fundamentally ill-equipped to produce faithful explanations. This issue with this autoencoder approach is not something that you could solve by modifying the method. It would be a category error to have optimism about natural language autoencoders being reliably able to produce faithful explanations.
I think automatic interp has enormous potential, but it is still immature in its current forms (and could give false illusions that we magically solved interp for ones that haven't played with such tools).
Relevant criticism on AO
https://t.co/6xNlsIkutg
https://t.co/vZI8OaPYUA
We developed a unified theory of generalization in deep learning. It explains grokking, double descent, benign overfitting, and implicit bias.
But theory is only half the story. It turns out that optimizing the population risk of any neural network amounts to a small change to your optimizer. 🧵
I guess another reason DeepSeek goes to such lengths for multi-teacher OPD is it's substantially more natural to RLmaxx tasks with mutiple objectives (correctness, format, CoT faithfulness increasing length penalty) in a narrow domain, than just GRPO-on-everything.
Gemma-4 has received more attention than prior generations of Gemma, but I still somehow feel like these models are so underrated. I don't think people fully realize how good Gemma-4 models are, especially for their size.
Recently read a great overview of multi-teacher on-policy distillation (MOPD) and how it was used in recent LLMs like MiMo-V2-Flash, GLM-5, Nemotron-Cascade 2, and DeepSeek-V4…
What is on-policy distillation (OPD)? The idea of OPD is simple. We have a student and a teacher. We sample trajectories from the student, then use reverse KL divergence as an objective to match the teacher’s log probability distribution along these trajectories. This training setup can be integrated into the GRPO loss by replacing group-relative advantage with reverse KL.
Multi-teacher OPD (MOPD) extends this idea by having more than one teacher during OPD training. This idea is useful due to the domain-specific nature of RLVR training. If we train a model on math with RLVR, this may improve math performance but harm model quality on creative tasks. Similarly, RL training a model on tool use data could degrade performance on general-purpose benchmarks. To solve this see-saw problem, we can train domain-specific models with RL and use MOPD to distill them into a single student.
Post-training with MOPD has become a common choice for recent models:
- MiMo-V2-Flash: starts from a general SFT model and uses domain-specific models from across the post-training pipeline (i.e., SFT models, RL specialists, and the student itself) as teachers for MOPD at the final stage of post-training, where teachers are selected heuristically by domain.
- GLM-5: starts from the final RL checkpoint produced by a sequential RL pipeline of reasoning, agentic, and general domains, using the final checkpoint of each stage as a teacher, where the teacher is again selected by the domain of the prompt. MOPD here aims to recover capabilities rather than merging them across domains.
- Nemotron-Cascade 2: places MOPD at the mid-point of post-training as a stabilization step between stages. Three prior model checkpoints are chosen as teachers from prior training stages for math, RLHF, and multi-domain RL.
- DeepSeek-V4: trains a very large number (10+) of domain experts independently using domain-specific SFT and RL, then distills all of them into a single student. This paper interestingly uses full vocabulary distillation, which has very high memory overhead and is infrastructurally complex, instead of approximating KL with a single logit.
The blog also includes a great snippet about why self-distillation is a useful addition in MOPD: “Self is a snapshot of the student at the start of MOPD—a fixed, stable reference distribution. On tokens where the SFT/RL teachers push the student into unfamiliar territory, distilling toward Self prevents catastrophic drift.”
Despite MOPD becoming quite common in several different reports, the approaches used are all quite similar (i.e., reverse KL, on-policy distillation, multiple teachers), indicating that OPD / MOPD is becoming a more standardized approach in training pipelines for recent models.
We found two bugs in DeepSpeed and OpenRLHF that deflated SFT baselines in several recent mixed-policy RL papers.
Once fixed, plain SFT-then-RL beats or matches every mixed-policy method we tested.