Great blog post on "Taxonomy of Principal Distances & Divergences" by Hamidreza Hashempoor from Institute for AI, University of Stuttgart.
Worth checking out!
https://t.co/ExabfkR2H1
Incredible how Z. ai literally has their RL infrastructure open source.
The entire OPD post-training of GLM-5.2 took on this slime platform took ~2 days.
https://t.co/XVjW6rGcbg
"Transformers" by Daniel Jurafsky and James H. Martin is one of the clearest and most mathematically grounded introductions to the Transformer architecture I have ever read.
Chapter 8 introduces the Transformer as the standard architecture behind modern large language models. What makes this chapter particularly interesting is its step-by-step presentation of the underlying mechanisms: contextual embeddings, self-attention, query, key and value vectors, scaled dot-product attention, multi-head attention, residual streams, feedforward layers, layer normalization, masking, and the parallel matrix formulation of attention.
In particular, the treatment of attention as a weighted sum of contextual representations is especially valuable. The chapter first develops an intuitive, simplified view of attention and then gradually derives the full formulation using the Q, K, and V matrices. This approach makes it easier to understand what is actually happening inside the architecture from an algebraic and matrix-based perspective, rather than simply viewing the usual block diagrams.
I think it is an excellent resource for anyone interested in understanding how Transformers work from linguistic, mathematical, and computational perspectives.
https://t.co/3fitdPy6Fv
Our CEO Tara Murphy Dougherty just wrapped a fireside chat with @demarest_colin of @axios at #Reindustrialize2026.
The focus? Closing the nearly $700M/day "Readiness Gap." We appreciated Colin’s questions on how our launch of Enterprise Readiness—and our evolution from Govini into Air—closes this gap.
By moving beyond static, legacy data silos into an AI-native execution platform, we can finally coordinate suppliers, operators, and federal agencies simultaneously.
#EnterpriseReadiness #DefenseTech #NationalSecurity #MilitaryLogistics #AI
Big news from #Reindustrialize2026 today: Govini is now Air. We've also introduced Enterprise Readiness, an AI-native architecture to bridge the "Readiness Gap.”
Despite the U.S. spending nearly $700M a day on readiness, defense supply chains and sustainment models are caught in a quagmire of fragmented data and manual processes. The frontline needs to move in seconds; yet the enterprise still responds in years.
That’s the "Readiness Gap" we are here to close. To help bridge this gaps, we are doing two things today:
• Launching Enterprise Readiness: An AI-native architecture designed to continuously coordinate development, production, sustainment, and delivery across the defense industrial ecosystem.
• Changing our corporate name: Govini is now Air. The new name underscores the company’s evolution from defense acquisition software pioneers to creating a continuously coordinated execution platform for national readiness.
The architecture behind our weapons is finally becoming worthy of them. Welcome to Air.
#EnterpriseReadiness #DefenseTech #AI #NationalSecurity #GoviniIsNowAir
This is the best site on the internet to learn harness engineering.
Free. Completely.
Most AI engineers have never heard the term.
https://t.co/bwDbTTYsjM
Bookmark this site.
Then read this setup ↓
РЕКОРД!
102 кілометри — дистан��ія ураження цілі вашим FPV квадрокоптерного типу без використання матки.
Далі глибше!
Донатьте на інновації!
Донатьте на русоріз!
Evaluations should not be static. We need to evolve evaluation sets / benchmarks over time so that they remain relevant and unsaturated.
There are three main ways we can refine our evals to make them better:
- Difficulty-based refinement: curating more difficult tasks or data to use for evaluation within a benchmark.
- Quality-based refinement: identifying and fixing issues in the benchmark (e.g., mislabeled data, vague or unrealistic questions, poor format, etc.).
- Diversity-based refinement: expanding the scope of questions and topics covered by a particular benchmark.
There are many ways to accomplish this, but here are a few concrete examples…
MMLU-Pro extends MMLU by making it more accurate, difficult and discriminative. Easy questions are removed by using model-based difficulty filtering, where we take a pool of eight models and remove questions that the majority of models get correct. More difficult questions are sourced from a variety of public datasets. All new and remaining questions undergo an extensive quality audit using a combination of human and LLM oversight.
MMLU-Redux takes a different approach of sampling ~100 questions per MMLU category and performing an extensive human quality audit. All questions are categorized into a pre-defined error taxonomy and modified by humans to form a more accurate benchmark. Around 7% of MMLU questions are found to contain errors, but the ratio varies by category.
BIG-Bench Extra Hard is constructed by replacing each task in BIG-Bench Hard with a corresponding task that tests a similar category of reasoning capabilities but is significantly more difficult. Tasks are sourced from a variety of existing reasoning benchmarks and manually chosen according to their topic and difficulty. Model-based filtering (i.e., testing a few models on tasks to see where they fail) is also used to inform the selection process. Benchmark authors prioritize longer problems that cannot be solved by cheating or random guessing.
RealMath and MathArena are both continually evolving math benchmarks. RealMath automatically updates with new problems derived from newly-published research papers and discussion forums. MathArena evaluates LLMs on math competition problems only within a short time window after their release to avoid contamination risk and updates frequently with new problems that become available.
DatBench refines a wide variety of benchmarks for vision language models (VLMs) using a combination of data filtering / selection techniques:
- Converting multiple choice to generative-style questions.
- Removing questions that can be solved with no vision info.
- Performing model-based quality filtering to find questions with quality issues that are then further filtered by a more powerful model.
- Selecting the most discriminative examples (i.e., meaning they differentiate between the performance of different models) using item-response theory.
A new and possibly controversial perspective:
In this video, I explain the sense in which generative AI trained by supervised learning is incapable of making novel discoveries.
https://t.co/zin5QbbT9N
The text of the speech:
AI Creativity and Discovery
Good day ladies and gentlemen. I regret that I am unable to be with you all today to engage in a back-and-forth discussion, but I am nevertheless pleased to be able to share with you, via this recording, some high-level thoughts about the current and future state of artificial intelligence, and in particular about AI’s relationship to science and mathematics, which is, as I understand it, the central focus of this meeting and of the SAIR Foundation.
I would like to start with an old joke; I am sure you have heard it before. It is the one about the researcher whose work is being evaluated, and the review comes back, and says “This work is both novel and good. Unfortunately, the parts that are good are not novel, and the parts that are novel are not good.”
My first point about AI is that this assessment applies exactly to large parts of AI as we know it today. Not all of today’s AI, but a large part of it. Pretty much all of what we mean by “Generative AI”---which includes large language models, and the images and video models, and even the new methods for learning world models. All of these AIs take large numbers of examples and produce a “model” which behaves similar to the examples, that is, which generates text like people, or images like artists or nature, and videos like we find on the internet. Don’t get me wrong, Generative AI can be extremely useful. No doubt about that. But the assessment of the joke still applies. These systems can produce output that is both novel and good, but not at the same time.
In many ways this is just absolutely not a problem. When we ask an AI for an answer from the internet, or to summarize a document, we don’t want it to be novel. We are happy if the quality of the answer, the goodness, comes from the source material—from the people who wrote the document or the articles on the internet. If the AI’s answer is novel it means it is going beyond the source material, adding something beyond it. This is what we call “hallucinations”. In most cases, we don’t like it when the AI makes something up, when it adds something novel.
One exception, of course, is when we are looking not for facts or reality, but for fiction and entertainment. We might ask for a bedtime story for a child, or an image based on existing images on the internet but which is nevertheless different and distinct from them. In these cases, it is never easy for us to know how creative the AI is actually being, as we do not know how close the AI’s story, poem, or image is to the source material. In a real practical sense we can not know this because the internet is too big, the possible sources that the AI may draw upon are too numerous.
When we ask for a fiction or novelty, the AI can give it to us because its processing is in part stochastic. Every decision can go multiple ways and will go different ways and produce a different trajectory every time. The trajectory can be random—and thus novel—or it can be based on the training data—and thus “good” because the training data is good, sourced from people or reality. Thus, the trajectory is either novel or good—based on randomness or based on data—but never both at the same time.
Really, I think it is okay if the output of Generative AI is never good and novel at the same time. For the researcher in the joke this is a devastating criticism, but for most things it is not, and for Generative AI it is not. Generative AI is meant to be a mimic. This is what supervised learning is for. Generative AI can be extremely useful, even when it just mimics, if it is faster, or cheaper, or smaller, or more customizable, or more copy-able, than the thing being mimicked. It is okay if Generative AI cannot be both novel and good at the same time. It is still a transformative technology.
But it is a limitation. And remember we are here to use AI for science and mathematics, and for these areas the assessment of the reviewer in the joke is devastating. For these areas we need true creativity and discovery. Generative AI—or Mimicking AI—will never get where us there. For these we need something more, and indeed we have something more in other parts of AI. We have many AI systems which can give us more. We have AlphaGo with its world-changing move 37, or AlphaZero with its brilliant original chess-playing style. We have GT-Sophy that drives simulated racecars better than any human. We have AlphaFold and AlphaProof and Claude-Code, which have brought true advances in science, mathematics, and programming. We have RL-Lyft which optimizes the assignment of cars to passengers in the ride-hailing business. All these systems have found things that are both novel and good. And, truth be told, some language models have been augmented in ways that make them more than Generative AI based on supervised learning.
All these systems have some additional features that make them capable of true creativity and true discovery. It is important for us to recognize what this is—and that it is not present in ordinary, garden-variety Generative AI. It is something that can not come from just supervised learning, from learning from examples. What is it? Well, it is a simple thing, a commonsense thing. It is not new. We have many names for it, but unfortunately none of them are very good names. I will call it Discovery. Basically, Discovery is just the idea of trying many things and seeing which of them work, then keeping those that worked the best. Evolution by natural selection works this way. The scientific method works this way. And just ordinary life and learning works this way. We try things and remember what works. What could be more obvious? In this behavioral case, psychology has two names for it— “instrumental learning” and “operant conditioning”—and in machine learning it is what we mean by “reinforcement learning”. We also see the idea of Discovery in planning and combinatorial search—anything that involves the idea of “generate and test”.
The essence of Discovery is to combine three steps:
1. Variation,
2. Evaluation, and
3. Selective retention.
Of course, I am not the first to say this. I am not the first to point out that this combination of steps is key to science, to evolution by natural selection, and to animal behavior. I think particularly of papers by Donald Campbell, by Daniel Dennett, and by Gary Cziko. What is new in my remarks is to directly relate the idea of Discovery to modern AI to help us see that it is not present in supervised learning or Generative AI—in particular, that Discovery is not present in backpropagation or gradient descent.
Let me say explicitly what is missing from Generative AI. As we have remarked, these systems do have a stochastic aspect, so they do generate a variety of trajectories and behavior. What is missing is the Evaluation step. The generator was pre-trained by supervised learning, leaving no way at runtime to Evaluate what it generates. And of course without Evaluation there can be no Selective retention, and thus no Discovery. The variation can bring novelty, but without evaluation there is no Discovery, and arguably, no creativity. That is, I would say that creativity requires that the new things generated be Evaluated. Without evaluation, and retention of the best, there is nothing created. The novelty flickers into existence but, if its value is unrecognized, it flickers away and is lost.
In many cases, Evaluation is done by people to make a discovery. As when we have Generative AI make many pictures for us, and then we pick the one that we like the best. The human+AI system completes the discovery.
In many other cases, the Evaluation comes from a clear objective. Some moves lead to checkmate, some steps lead to a proof, some actions result in high reward, some genotypes make more copies, some theories explain the data better.
Some prefer the Variation step to be called Blind variation, where “blind” here means that it is uninformed, a shot in the dark. It does not need to be completely uninformed; a good scientist does not select theories to test at random. But neither can it be completely informed and determined. There must be some uncertainty about where the answer lies in order for there to be a discovery. In practice, the variation is partly informed and partly blind, but it is the blind part that corresponds to the discovery.
Now let us briefly go all the way to modern deep learning, to the backpropagation algorithm. At first it might seem that backpropagation is incapable of discovery because it is deterministic and thus incapable of variation. But this is not correct. The weight updates of backprop are deterministic, but the weights are initialized to small random values. The random initialization is often downplayed, but in fact it is a necessary form of variation; it must be done properly to get good performance. In backprop this Variation is done once, at network initialization, so its effect is temporary, and later the network may lose its ability to learn. This is the weakness of deep learning that is alleviated with a new algorithm that my group presented in Nature a couple of years ago. Our “continual backpropagation” made one small change: every so often a less-used neuron would be re-initialized to small random weights. This allows the variation to continue and plasticity to be retained.
Although there is much more to be said about Creativity and Discovery, this is the key point: they are more than supervised learning, more than pattern recognition, more than prediction, and more than world modeling. Those things are important, but they alone will not bring us to discovery. Discovery requires Evaluation from a person or from an explicit goal, and only in the latter case will we attain full autonomy.
So that is my call to arms. If we want the full power of AI scientists, then we should share the goals with them so they can create, evaluate, discover, and in these ways fully participate in achieving the goals. Let’s be bold! Let’s fully automate Creativity and Discovery!
Breaking: Govini has been named #18 on the 2026 #NatSec100 list issued by @SVDG_official and @JPMorgan, achieving the largest upward advancement of any company on this year's index! 🚀
Our climb to #18 reflects the rapid market adoption of Ark—the first and only AI-enabled product purpose-built to connect the critical functions of major defense program acquisition, from early concepts through production, sustainment, logistics, and modernization.
Read our full announcement here: https://t.co/lKm7s5LnKU and the 2026 NatSec100 report here: https://t.co/25FeUizYo0
#NatSec #DefenseInnovation #Software #Govinie
SID-1 is an agentic search model by @SID_AI
→ 1.9x recall over RAG + rerank
→ 24x faster, 99% cheaper than GPT-5.1
trained using large-scale RL on turbopuffer at 1k+ QPS bursts over 10M+ document corpora across thousands of steps
https://t.co/hqmdPUmdLt
Karpathy's prediction about RL is coming true now!
He called reward functions unreliable and argued that a single reward number is too low-dimensional to teach an agent what "good" means for complex tasks. To solve this, Agents need a knowledge-guided review as a higher-dimensional feedback channel.
Every major AI lab trains models with RL today (OpenAI, Anthropic, DeepSeek).
And their key bottleneck has always been the reward functions.
GRPO by DeepSeek worked well for math and code because the environment gave a binary signal.
But for real agent tasks, someone still has to hand-code the scoring function. That takes days and breaks every time the pipeline changes.
RULER (implemented in OpenPipe ART, 10k stars) addresses the exact problem Karpathy identified.
The reward criteria are defined in plain English, and an LLM evaluates each trajectory against that description to provide feedback for training.
I trained a Qwen3 1.4B agent that plays 2048 using GRPO with this exact workflow.
In this case, the agent saw the board, picked a direction, and RULER evaluated the outcome, all from this natural language definition.
You can see the full implementation on GitHub and try it yourself.
Here's the ART Repo: https://t.co/fsoLXDK4Zu
(don't forget to star it ⭐ )
Just like RLHF replaced manual rankings and GRPO replaced the critic model, natural language rewards are replacing hand-coded scoring functions.
RL reward engineering is now prompt engineering.
I wrote a full walkthrough covering RL for LLM agents, from RLHF to GRPO to RULER, in the article below.
Super excited to share some of our research on making Genie the best data agent. Data Agents open up a new research frontier for solving complex, real-world enterprise challenges. I was recently discussing with a colleague whether there are still interesting research challenges for Coding agents, and while I believe there are still many challenges there (a topic for another day!), I wanted to highlight some of the unique research challenges for Data Agents and how we tackle them to get up to 3x accuracy improvements on top of coding agents!
https://t.co/SbwG5dYYNt
We’re releasing a 30B-A3B reasoning model that reaches gold-medal level across both physics and math Olympiad evaluations: IPhO directly, and IMO/USAMO with test-time self-verification and refinement.
A simple, unified scaling recipe for proof search.
https://t.co/yc2ZlLVbD2
Qwen3, GLM-5, and MiMo all use on-policy distillation in post-training. Thinking Machines also wrote it up as a cheap alternative to RL.
But in practice it is surprisingly brittle to make work — much more so than SFT or RL.
Three recent papers [1, 2, 3] helped me make sense of why. The mechanism is consistent across the failure modes they describe, and it's worth understanding before running another OPD experiment.
OPD looks like "match the teacher distribution." But in practice, the update is driven by a very small set of next-token choices at each generation step. Mostly just the handful of tokens that both the student and teacher think are plausible as the next token. Once that small set breaks, OPD breaks.
The real object OPD is learning on
At every generation step, the model has a huge vocabulary. 150K tokens, maybe more. But almost all of the probability mass sits on a tiny number of tokens.
One paper [1] shows that the overlapping high-probability tokens between teacher and student carry around 97–99% of the total probability mass. So although OPD is written as reverse-KL over a full vocabulary, most of the useful learning signal comes from a tiny local menu of next-token options.
Here is what I mean by "the handful of tokens." For a given prefix like:
"Let's solve this step by step. First, we…"
the student may think the next token should be one of:
"need", "can", "have", "find", "know", "compute", …
The teacher has its own version of this menu.
OPD works when these two menus mostly overlap, and when the teacher puts higher probability on better choices inside that menu.
It fails for a few different reasons — the menus don't overlap, the menu drifts somewhere bad mid-training, we only look at one item on the menu instead of the whole thing, or the per-position learning signals end up pulling the model in inconsistent directions.
Four things can go wrong. The first two are about whether the menu is in good shape. The third is about whether we look at the whole menu or just one item from it. The fourth is about whether the signals across positions combine into a useful update.
1. The student and teacher are thinking in different "languages"
A stronger teacher does not necessarily make a better OPD teacher.
Li et al. [1] shows this very clearly: a 7B teacher can outperform a 1.5B model on benchmarks, but still fail to improve the 1.5B student through OPD.
Why? Because benchmark accuracy measures final answers. OPD trains on next-token probabilities. A stronger model may solve the same problem through a different reasoning path: different intermediate steps, different phrasing, different proof structure, different local token choices.
So when the student writes its own partial solution, the teacher may not assign useful probability to the student's next natural steps. The teacher is better overall, but not necessarily helpful on the student's current path.
The most interesting experiment is the "reverse distillation", where they take a 1.5B model that was improved by RL — a student that has already moved beyond its original base behavior — and try to distill it back using two teachers: the original pre-RL 1.5B model and a larger 7B model from the same family.
Both teachers pull the RL-improved student backward. The student loses its RL gains and regresses toward the older behavior.
This sounds surprising at first. But the explanation is simple: OPD does not know that the student's RL behavior is better unless the teacher's token probabilities support it. If the teacher still prefers the old reasoning pattern, OPD will train the student back toward that pattern.
So the RL gains disappear not because the teacher is "weak" in benchmark terms, but because the teacher is giving token-level supervision for a behavior the student has already moved past.
Benchmark gap does not tell you whether OPD will work. Token-level compatibility does.
2. Repetition becomes locally rewarding
Even if OPD starts well, it can still collapse.
The most striking failure mode is when training looks fine for a while, then within roughly 30 steps the model starts producing much longer outputs, stops terminating, repetition spikes, and accuracy collapses.
The mechanism is counterintuitive at first but makes sense once you see it.
In sampled-token OPD, the reward for a token is roughly the teacher's log-probability minus the student's log-probability on that token. So if the teacher gives a token much higher probability than the student does, that token receives a large positive signal.
Now imagine the student starts repeating itself. In practice this looks less like coherent sentences repeating and more like degenerate loops — something like "wait, wait, wait, wait, wait" filling the rest of the context.
This prefix is bad globally. But locally, it is very predictable.
A strong teacher is often very confident about predictable text. Once the loop has gone on for a while, the teacher can assign high probability to the next repeated token. The student may be less confident than the teacher. So the repeated token gets a large positive log-ratio.
That means OPD accidentally rewards continuing the repetition.
Before repetition starts, repeated tokens are rare, so they don't matter much. But once repetition appears, those tokens become frequent. And because they also receive large positive advantages, they start dominating the update. Luo et al. [2] measures repeated tokens getting 4 to 9 times larger advantage than normal tokens after collapse.
Then the loop reinforces itself:
more repetition → more predictable prefix → higher teacher confidence → larger positive signal on repeated tokens → even more repetition
This is different from the usual length-bias issue in RL. It's more specific — a broken prefix creates locally high-reward repeated tokens, and OPD faithfully amplifies them.
3. We often only look at one item on the menu
The clean objective would compare the teacher and student distributions over multiple possible next tokens. But many public OPD recipes — including the ones used industrially — use a cheaper version:
Let the student generate one token. Then ask: did the teacher assign this exact token higher or lower probability than the student? If teacher probability is higher, push the student toward it. If lower, push the student away.
That is the sampled-token log-ratio. It is cheap because you only score the token the student actually sampled. You do not need to compare the full vocabulary.
There is a real reason for this design choice. Full sequence-level reverse-KL is noisy for long generations because an early token update gets entangled with many future rewards.
Token-level OPD avoids that by giving each token its own local feedback. That gives much better variance scaling with length [3] — worst-case variance grows as O(T²) for token-level instead of O(T⁴) for sequence-level. So for long reasoning traces, token-level feedback is attractive.
The problem is that "one sampled token" is a very noisy view of the teacher's actual next-token preference. At a given step, the teacher may have a whole cluster of reasonable next tokens. But sampled-token OPD only checks the one token the student happened to pick.
This creates three problems.
First, the student samples tokens from its own distribution, so on most positions the student's probability exceeds the teacher's and the log-ratio is negative. The reward is computed as teacher minus student, and the student is picking tokens where its own log-prob is near its highest — meaning the subtraction is almost always against the student's strongest values. Positive signal only shows up when the student happens to sample a token the teacher likes even more than the student does, which is the minority case.
Second, if the student drifts into weird prefixes, the teacher's local probabilities may no longer reflect global quality.
Third, tokenization and special-token differences can create fake disagreements. The student and teacher may represent the same text with different token boundaries, so a single-token comparison can look terrible even when the underlying string is fine.
The fix proposed in [3] is simple: don't guess the teacher's local preference from one sampled token. Instead, take the teacher's top-k next tokens, renormalize both teacher and student probabilities over that set, and compute reverse-KL there.
It's still cheap — you only need the top-k logits, not the whole vocabulary. But it changes the supervision from "did the teacher like this one sampled token?" to "among the teacher's plausible next tokens, does the student put probability mass in the same places?"
That is a much better local learning signal.
They also add top-p sampling during rollout, so the student is less likely to wander into extremely low-probability prefixes, and mask special tokens to avoid fake tokenization mismatches.
4. Even good token signals may not add up
This is the least developed of the four failure modes, and possibly the most important.
Li et al. [1] compares a successful OPD setup against a failing one and finds something strange. The failing teacher's per-token advantages are actually larger than the working teacher's, but the gradient norms are smaller. And the failing teacher's sequence-level reward can still distinguish correct from incorrect rollouts, comparable to the working case.
So the reward signal is globally informative. It just does not produce useful gradients.
Turns out what's going on is that OPD computes a learning signal at every token position in the rollout, and then sums them into one gradient update.
Each position's contribution is a vector in parameter space pointing in some direction. So if the per-position vectors mostly point in the same direction, they add up to a big coherent push. But if they point in different directions, they partially cancel when summed, and the model barely moves even though each individual position had something to say.
The empirical fingerprint is consistent with the second case: large per-token advantages but small gradient norms after summing — individually strong signals that cancel out when combined. The successful teacher shows the opposite pattern: smaller per-token advantages, larger gradient norms after summing — weaker individual signals that reinforce each other when combined.
The paper that raises this leaves it as a hypothesis, but the empirical fingerprint — large per-token advantages, small gradient norms, informative sequence-level reward — is specific enough that it should be testable?
What this explains
A lot of practical OPD fixes start to look related.
SFT cold start helps because it moves the student closer to the teacher's reasoning style before OPD begins.
Teacher-aligned prompts help because they put the student in regions where the teacher gives more reliable feedback.
KL regularization helps because it prevents the student from drifting too quickly into weird generations.
Mixture distillation helps because it keeps some clean reference trajectories in training, so the rollout distribution does not become fully self-generated garbage.
Top-k matching helps because it stops pretending one sampled token is enough to represent the teacher's local preference.
These look like different tricks. But they are all trying to protect the same thing: the small set of plausible next tokens where teacher and student can actually communicate.
The ceiling, and the real open question
The more interesting part is long-horizon reasoning.
The deeper the student gets into its own generated solution, the more likely the prefix is something the teacher would not have written. And once the teacher is judging prefixes outside its own natural distribution, its token probabilities become less reliable.
One paper [1] shows this directly: teacher continuation advantage drops sharply as the student prefix gets longer, from +0.37 at 1K prefix to +0.02 at 16K.
That is a bad sign for long-CoT and agentic OPD, because those are exactly the settings where the student spends many steps inside its own partially-generated world.
OPD works best when the teacher and student stay close enough that teacher probabilities remain meaningful. Long-horizon agentic training pushes in the opposite direction.
This is the open question I would most want to see investigated. The failure modes in sections 1-3 are diagnosable and have proposed fixes. Section 4 is a hypothesis about local gradient structure that should be testable. But the long-horizon ceiling is different — it is about whether OPD's core assumption (the teacher knows something useful about the student's next token) can hold at all when the student is operating many steps inside its own generated world.
My current takeaway
OPD isn't really a full-vocabulary distribution-matching problem. It's a fragile communication protocol between teacher and student through a tiny local menu of next-token choices. When that menu overlaps, stays clean, and produces gradients that add up, OPD works. When it doesn't, OPD quietly trains the student in the wrong direction.
References
[1] Li et al. Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe. https://t.co/12maoXIwKT
[2] Luo et al. Demystifying OPD: Length Inflation and Stabilization Strategies for Large Language Models. https://t.co/VenEbNb6k7
[3] Fu et al. Revisiting On-Policy Distillation: Empirical Failure Modes and Simple Fixes. https://t.co/SJCYO8S8Sn
This 115-page book unlocks the secrets of LLM fine tuning.
https://t.co/Uhs8edPUV8
A comprehensive guide which covers:
> the fine-tuning process for LLMs
> combining both theory and practice.