Recently met @srush_nlp and he started giving me an impromptu lecture on how targeted on-policy self-distillation works.
I asked him if I could record it on my iPhone.
The basic idea is this: if the model made a mistake at some point in the rollout (for example, calling a tool that doesn't exist), we want to discourage this specific error, but we don't want to just learn from the final reward, because it's a very noisy signal spread out over the whole trajectory.
So we have another model read this trajectory and figure where the error was made. It simply inserts some hint tokens to the part of the trajectory right above where the mistake was made.
Now with these injected hint tokens, have the model run a forward pass. You're not having to regenerate a new rollout - aka no new decode required.
The hint causes the model to assign lower probabilities to the error tokens. You then trains the original model to match these new probabilities, teaching it to downweight that specific mistake.
What is mid-training?
The stage between pre-training and post-training
A base model is continued on a smaller, curated data mixture chosen to strengthen capabilities that the original pre-training run undercovered, such as multilinguality, domain knowledge, or long-context extension.
It usually keeps a pre-training-like objective, but uses higher-quality or more targeted data so later instruction tuning, preference tuning, or RL can shape behavior on top of stronger capabilities.
Learn more here: https://t.co/WhpYkyGlv8
@martin_casado there are two large, capable, and well-resourced entities with clear strategic interests in ensuring open models keep up: China and Nvidia
preventing distillation and capturing market share are in tension. it'll be hard to distill GPT-7-BioChem, easy to distill Default Claude.
@Miles_Brundage Four months actually seems likeโฆ a decent compromise between open and proprietary?
Big labs get to keep saying they have the best models at any given time, but four months is a small enough gap that itโs still very reasonable to keep investing in open weights as well
yeah no itโs good work to be sure, but I want to think soberly about the difference between optimizing a specific training pipeline vs. an order-of-magnitude performance leap in general purpose tooling
the former is good engineering work (that I assume all the big labs are doing to some extent), the latter would be a breakthroughโฆ but I donโt think this is the latter
The implied win here is โour C framework is 10x faster than JAX.โ
Surely the actual win is good kernel work optimized for their specific cluster configuration?
Perf folks: what would JAX/XLA fundamentally fail to express here?
SpaceX has almost finished writing V1.0 of an in-house AI training stack in C that exact-maps to 220k GB300s with 800G NICs, making heavy use of pipeline parallelism and getting as close to bare metal as possible.
The potential speed improvement vs JAX for large training runs is over an order of magnitude.
@maharshii Is it really a general framework tho? or just a specific training pipeline with good kernels optimized for a specific cluster configuration?