TMax: An open RL recipe for terminal agents
I’m very excited to get to share a new RL paper today that I got to have a small part in – a type of paper I suspect we’ll see much more of in the future. The key is that RL research is very different today, in mid-2026, than what most observers have in their context. The average conception of an RL paper is grounded in the RLVR revolution of early 2025, where many people could use vanilla RLVR libraries to hillclimb on math benchmarks. Crucially, this style of math work could be done on base models or fairly stably on already trained models. With agents, the tasks of focus are very hard, requiring complex tool-use, harnesses where the model automatically manages its history, and much more training to make smaller eval improvements. We’re shifting from a renaissance of RL study to rapidly needing to improve its empirical rigor and common community engagements.
TMax is the best open data for hillclimbing on frontier terminal tasks. It’s been validated with rigorous experiments, and if the authors wanted to just form a “RL environments startup” they could probably sell it for millions of dollars. This data work is some of my favorite stuff to be around in my 2.5+ years at Ai2.
As a general summary, the recipe is open data and recipe lessons from hillclimbing the Qwen 3.5 smaller, dense models on terminal tasks. These models are super hard to hillclimb in this area, as they’re already trained heavily on the task. The training is very infrastructure-dependent, and most of the RL innovations are more designed to make training stable than to improve the rate of learning.
I strongly recommend this paper. I joke around that I was happy to be an author just so I had to read it twice! You can find Hamish’s thread sharing more here or read the paper here. You can click through to find the model weights, the data, and even some fun further artifacts to study like all the RL rollouts from a training run – where the model sometimes became aware that it was being tested.
The biggest takeaway I have from following this work, and more of the work in the community, is how important recipe work is. Let me define “recipe work.” It is a style of paper that explains all the steps you need to make crucial model improvements – data, algorithm, codebase, pitfalls, etc.
Getting started in meaningful RL experiments today is a substantial expense. There are a ton of companies, an entire industry emerging really, around the idea of taking open-weight language models and finetuning them with RL on your domain-specific tasks. What I see in many projects is that getting an initial baseline is very hard. This phase, which can cost weeks and anywhere from $10K to $1M+, feels like spinning your wheels (A fun fact is that an RL step on a model like Nvidia Nemotron 3 Ultra on Tinker costs $1K and a meaningful RL run would be hundreds of steps – credit Edward Hu). It takes a lot of time to get traction in learning signal on meaningful, hard RL tasks.
What we need as a community is a way for people to study small ablations to established RL recipes, as most labs won’t have the resources to do it from scratch in a meaningful way. This is what I hope TMAX can be for terminal agents, or the start of. Yes the training jobs are expensive, as the paper documents a standard training job being 8 nodes of H100s (2 train 6 inference) for 2-3 days, but that is approaching something academics can study. The establishment of this recipe took O(100) of these training jobs to get right.
This isn’t my first time trying to establish this direction. When we launched Olmo 3 we had the “RL Zero“ model families, which are clean RL runs from a base model on a certain domain. This type of recipe-dependent work is a clear indicator that meaningful post-training work today looks much more like pretraining work of years past. We need decision-making ladders, clear ways of seeing small improvements in the models, stability, and so on.
Part of this is down to academic gatekeepers, who won’t reward a paper doing very clean empirical work to push a recipe 1-2% up. They’ll favor a “new algorithm” that matches results, or something sort of bogus. My hope is that we can have multiple, stable, clear recipes across agent types, so innovations can be tested more clearly in multiple domains. (If you’re working on this, please reach out – I’m happy to support if I can, but I likely can’t reply to every email).
As a quick aside, the RL frameworks in vogue today seem to be SLIME and SkyRL. The libraries of choice have shifted throughout these seasons in RL, which further contributes to a form of fragility in the literature. A bit of continuity will go a long way.
So, go read this paper. It’s a really great example of how seemingly simple data and infrastructure work can be very hard and impactful. It’s also got me looking for more applications of Divergence Proximal Policy Optimization (DPPO) as another small evolution to the best RL algorithms of the day, by virtue of being a bit more stable by improving token-level clipping.
"A self-amplifying cycle of math-like cognitive exploration, discovery, pleasure, and addictive investment of his mental energy." Hard to get these with phones.
"Tao’s early and intensive study of mathematics" isn't a "reason" but a symptom that, from an early age, he had entered a self-amplifying cycle of math-like cognitive exploration, discovery, pleasure, and addictive investment of his mental energy.
Asking why this happened to him and not to someone else is a bit vacuous, like asking why a hurricane formed on this very day at this very spot, and not the next day at another spot. There are obvious cofactors, like the favorable parenting style of his family, but they only explain a fraction of the variance.
This, by the way, is a normal pattern, a manifestation of Turkheimer's third law.
i feel like it is overall by default too distracted by social dynamics at the expense of object level analysis and idk quite how to prompt it to stop flipping between agree w mimi / be contrary to mimi as a prior that makes whatever it says somewhat less intelligent for having chosen a side and then focused only on points that most clearly support that side and neglecting important aspects that are orthogonal
It’s actually kinda cool that we are in an age of monotony and extreme conformity that anyone without design knowledge can make a great visual identity simply due to the fact that it’s non-conforming. All DIY branding goes hard, doesn’t matter how bad, because it is unique: the primary objective of a visual identity. Having no design knowledge is a pre-requisite and you need to be sterile/isolated from mainstream global design culture.
It might be just total shit in another era but not in this, because the bar is so low. Engineers and accountants alike, rejoice! Draw anything in MS paint, it will be great! 👍
pre-registering: despite its terminalbench scores, i will find 5.6 Sol max extremely useful for implementation, review, and debugging, but less useful than fable for architectural and design tasks, and continue my pattern of primarily driving a Claude with codex subagents
I'm afraid this is mostly an effect of "the curse of depth"
@behrouz_ali don't you think so? If later layers predominantly serve as filters/feature suppression rather than continuation of complex circuits, makes sense that they don't need a lot of capacity.
🚨 Exclusive GPT-5.6 scoop:
- Today 5.6 launched for OpenAI enterprise partners for testing ahead of the wider launch
- ETA for wider launch is the 2nd week of July
- There will be NO pricing changes
- A new "max" reasoning effort will be introduced for the 5.6 series
- The model is less token efficient than 5.5
I find it quite disturbing that prefix caching considerations make it really hard to justify any context optimization/organization innovation, as whatever you may want to do will wreck the cache
The question is, are companies with high scores on ARC-AGI training on ARC-AGI-like synthetic problems? I think xAI does…
Honestly I think that "agentic workflows" is a misguided perspective and an irrationally bearish outlook on generalization. Teach models to reason.
in the (human-observable) limit, these language models will all be speaking chinglish:
- most highly-educated digital workers speak english or chinese
- peak token efficiency <> expressiveness
- compact translations of those four-word chinese idioms (成语) or continued language compaction with gen-z slang and beyond
- common grammatical structure amplified between the two languages when expressing concepts in corporate environments (this++ if/when there's also overlap with hindi)
or maybe more simply: we're purely in a digital data numbers game (against a perfect ai translation amplifier) from here on out
@jlffinance Very similar to Slack: deep infra adaptation. This is a nice sweet spot: very conservative profession but way too ouch complexity for skills to be manageable.
Frontier companies may be unable to capture much of the benefits of AI development due to distillation. If this market failure ends up defusing the dangerous AI race, it would be an illustration of the Theory of the Second Best (which is one of my favorite ideas in economics).
@deepfates hey billionaires which of you is motivated to max the proportion of biodiversity that will ever be catalogued in our lightcone
it's cheaper than you'd think to scale-up species discovery. there's some urgency. esp if you assume extinctions outpace speciations (pretty safe bet)
@deepfates Yeaa turns out pollination ecology doesn't pay Does this work can I just convert a c-corp and start raising philanthropic money to fund biodiversity monitoring