Reinforcing Recursive Language Models
Can a 4B model learn to recursively call itself to answer hard long-context questions?
We RL fine-tuned a small model to behave as a native RLM.
On evidence selection across scientific papers, our 4B RLM matches Sonnet 4.6 in quality while running significantly faster and cheaper.
double descent is one of the most counterintuitive results in deep learning, and you can't really show it on plain MNIST. MNIST-1D can. huge thanks to @samgreydanus & @hippopedoid for the paper on @askalphaxiv. new video reproduces it from scratch 👇
TMax: An open RL recipe for terminal agents
I’m very excited to get to share a new RL paper today that I got to have a small part in – a type of paper I suspect we’ll see much more of in the future. The key is that RL research is very different today, in mid-2026, than what most observers have in their context. The average conception of an RL paper is grounded in the RLVR revolution of early 2025, where many people could use vanilla RLVR libraries to hillclimb on math benchmarks. Crucially, this style of math work could be done on base models or fairly stably on already trained models. With agents, the tasks of focus are very hard, requiring complex tool-use, harnesses where the model automatically manages its history, and much more training to make smaller eval improvements. We’re shifting from a renaissance of RL study to rapidly needing to improve its empirical rigor and common community engagements.
TMax is the best open data for hillclimbing on frontier terminal tasks. It’s been validated with rigorous experiments, and if the authors wanted to just form a “RL environments startup” they could probably sell it for millions of dollars. This data work is some of my favorite stuff to be around in my 2.5+ years at Ai2.
As a general summary, the recipe is open data and recipe lessons from hillclimbing the Qwen 3.5 smaller, dense models on terminal tasks. These models are super hard to hillclimb in this area, as they’re already trained heavily on the task. The training is very infrastructure-dependent, and most of the RL innovations are more designed to make training stable than to improve the rate of learning.
I strongly recommend this paper. I joke around that I was happy to be an author just so I had to read it twice! You can find Hamish’s thread sharing more here or read the paper here. You can click through to find the model weights, the data, and even some fun further artifacts to study like all the RL rollouts from a training run – where the model sometimes became aware that it was being tested.
The biggest takeaway I have from following this work, and more of the work in the community, is how important recipe work is. Let me define “recipe work.” It is a style of paper that explains all the steps you need to make crucial model improvements – data, algorithm, codebase, pitfalls, etc.
Getting started in meaningful RL experiments today is a substantial expense. There are a ton of companies, an entire industry emerging really, around the idea of taking open-weight language models and finetuning them with RL on your domain-specific tasks. What I see in many projects is that getting an initial baseline is very hard. This phase, which can cost weeks and anywhere from $10K to $1M+, feels like spinning your wheels (A fun fact is that an RL step on a model like Nvidia Nemotron 3 Ultra on Tinker costs $1K and a meaningful RL run would be hundreds of steps – credit Edward Hu). It takes a lot of time to get traction in learning signal on meaningful, hard RL tasks.
What we need as a community is a way for people to study small ablations to established RL recipes, as most labs won’t have the resources to do it from scratch in a meaningful way. This is what I hope TMAX can be for terminal agents, or the start of. Yes the training jobs are expensive, as the paper documents a standard training job being 8 nodes of H100s (2 train 6 inference) for 2-3 days, but that is approaching something academics can study. The establishment of this recipe took O(100) of these training jobs to get right.
This isn’t my first time trying to establish this direction. When we launched Olmo 3 we had the “RL Zero“ model families, which are clean RL runs from a base model on a certain domain. This type of recipe-dependent work is a clear indicator that meaningful post-training work today looks much more like pretraining work of years past. We need decision-making ladders, clear ways of seeing small improvements in the models, stability, and so on.
Part of this is down to academic gatekeepers, who won’t reward a paper doing very clean empirical work to push a recipe 1-2% up. They’ll favor a “new algorithm” that matches results, or something sort of bogus. My hope is that we can have multiple, stable, clear recipes across agent types, so innovations can be tested more clearly in multiple domains. (If you’re working on this, please reach out – I’m happy to support if I can, but I likely can’t reply to every email).
As a quick aside, the RL frameworks in vogue today seem to be SLIME and SkyRL. The libraries of choice have shifted throughout these seasons in RL, which further contributes to a form of fragility in the literature. A bit of continuity will go a long way.
So, go read this paper. It’s a really great example of how seemingly simple data and infrastructure work can be very hard and impactful. It’s also got me looking for more applications of Divergence Proximal Policy Optimization (DPPO) as another small evolution to the best RL algorithms of the day, by virtue of being a bit more stable by improving token-level clipping.
Autoresearch is quickly becoming one of the most exciting frontiers in AI. We've moved past simply answering questions into carrying out real experiments end-to-end.
Huge thanks to the community for pushing these boundaries with us.
"Sparrow: Sparse Rollout for Stable and Efficient Long-context RL of LLMs"
Long-context RLVR is powerful, but most of the cost comes from generating huge CoT rollouts.
Sparse attention would make this faster, it's just that the naive sparse rollouts would drift from the dense policy and can collapse training.
This paper finds that stability depends on the worst-aligned tail of tokens, not the average mismatch.
So by keeping the 5th percentile sparse-to-dense acceptance rate around 0.86 with dynamic sparsity, Sparrow gets about 2x rollout speedups while matching dense RL performance.
"VIMPO: Value-Implicit Policy Optimization for LLMs"
While GRPO is simple because it avoids a critic, it still gives every token in a reasoning trace the same reward signal.
This paper tries to get the best of both worlds by deriving the value function from the policy itself using policy-reference log-ratios, so the model gets token-level credit assignment without training a separate critic.
On math RLVR, VIMPO beats GRPO across MATH-500, AIME 2024, AIME 2025, and OlympiadBench, and stays stronger under noisy rewards.
4/4: You can play with this yourself! Visit https://t.co/8PXLy9u7e5 (https://t.co/orpVTpjX7e) or change ‘arxiv’ to ‘autoarxiv’ on the official SkyRL paper https://t.co/12h0WcIdhF and use the GLM 5.2 model to iterate on the repo!
Introducing GLM 5.2 for autoresearch
GLM 5.2 is the first open weights model we've tried on our autoresearch pipeline that's proven capable for real research tasks.
With Fable 5's restrictions on research, having an open weights alternative is a huge win for open source
Watch it carry out fully async vs colocated sync RL training on Harbor code contests across two 8xH100 nodes on top of SkyRL. Resolves setup issues, tracks runs to completion, and produces a full comparison of throughput and reward stability
3/4: One limitation worth noting: GLM 5.2 has no image understanding. While Opus and Fable can consistently identify trends in WandB charts, GLM resorts to writing numpy code to smooth and clean the raw WandB numbers before analyzing. For simpler runs like this example this is sufficient, although with experiments that involve larger sweeps or ablations, this would likely cause GLM more problems than not.
"How Transparent is DiffusionGemma?"
With how Diffusion LMs denoise tokens instead of generating from left to right, there's a concern about how reasoning will be hidden in the denoising steps.
But this paper shows DiffusionGemma is more readable than expected.
At first, it looks much more opaque than Gemma 4 because the model runs many denoising steps before reaching something we can clearly read. In the paper’s opaque serial depth metric, DiffusionGemma is 608,016 vs 21,235 for Gemma 4, which is where the 28.6x more opaque.
But once the denoising bottleneck is treated as readable by mapping hidden states into likely text tokens, DiffusionGemma drops to 23,571, only 1.1x Gemma 4.
What's also really cool is that it contains some diffusion-only behaviors. It can change earlier answers after later reasoning, guess response length early, write code skeleton-first, and keep multiple token possibilities alive before settling.
Turn any paper into running code.
Just swap arxiv → autoarxiv in the paper url.
That hands the paper to an AI agent from alphaXiv. It reads the abstract, the claims, and the linked GitHub repo, then clones the codebase and works through the usual setup pain like dependencies, broken paths, environment config, and hardware assumptions.
From there it designs a minimal reproduction. That means a smaller model, fewer steps, and a single GPU instead of a cluster, scaled down just enough to test whether the headline claim holds.
The whole run is live and fully logged. Loss curves, metrics, and training progress are all observable as it happens.
What comes back is a clean signal on whether the minimal run matches the paper's reported result, plus an estimate of what a full replication would cost in compute and time.
A lot of research code dies in setup before anyone verifies a single number. This moves reproduction from a weekend of debugging to a url change.
Pick a paper and try it now.
video credits: @askalphaxiv
“Fixed-Point Reasoners: Stable and Adaptive Deep Looped Transformers”
While Looped Transformers can spend more depth on harder problems, they still need a good way to know when to stop.
This paper makes the hidden state itself the stopping signal, basically it keeps looping until it converges to a fixed point.
With pre-norm, residual scaling, and damping, FPRM becomes stable at large depths, adapts compute to task difficulty, and beats similar 7M reasoning models on Sudoku, Maze, ARC-AGI-1, and state tracking.
“Do as I Do: Dexterous Manipulation Data from Everyday Human Videos”
With how robot dexterity is bottlenecked by data as teleoperation and MoCap are expensive and internet videos are only observational, this paper turns normal RGB human videos into executable robot hand trajectories.
It reconstructs the hand and object, tracks the object with SAM 3D guided diffusion, then retargets the motion through physics-aware optimization so a real dexterous hand can do the same task.
The result is a path from internet video to real robot rollouts, with retargeting success improving from 25% to 71% on noisy reconstructions.