💥Today we release InferenceBench, our next benchmark after PostTrainBench that measures progress on AI R&D automation.
AI R&D automation will very likely unfold gradually, starting from “boring” tasks like inference speed optimization that are very easily verifiable (accuracy + inference time). We show a rather negative result for current frontier agents. They are not good at system-level engineering and managing complex dependencies. They do show non-trivial performance, but they fail compared to a simple baseline: hyperparameter tuning of vLLM/SGLang hyperparameters.
Importantly, InferenceBench tests *open-ended* inference optimization capabilities. This is different from more narrow benchmarks like KernelBench that only let agents optimize kernels (which is a very valuable task, too!). The benchmark is intentionally open-ended, so the poor performance of the agents is not an underelicitation issue. The agents have everything needed to succeed, but they still fail because they are not yet reliable enough for this task.
Our results suggest an inverse scaling phenomenon: Claude Sonnet 4.6 and GLM-5 rank highly because they more often preserve simple, valid, high-performing final servers, while several larger models show stronger peak runs but lose utility through brittle final-state choices. This contrasts with benchmarks where rankings track raw capability (e.g., SWE-Bench, Terminal-Bench, PostTrainBench, FrontierSWE).
One of the primary bottlenecks we have clearly observed is the lack of diversity of strategies: nearly all agents just use vLLM, without exploring alternatives. Overall, proper exploration is lacking: the current agents are not ready to tackle broad enough goals and get stuck after the first found solution (such as vLLM). I’m sure future agents will do much better, but here is where we are now.
This benchmark is our 2nd one in a suite of benchmarks that will track the progress on AI R&D automation. We will develop many more benchmarks that will cover different aspects of AI R&D automation, culminating in recursive self-improvement. Stay tuned!
🚨 FutureSim Update 🚨
We evaluated Opus 4.7 at max reasoning in Claude Code.
Despite potential test-set contamination with knowledge cutoff of Jan '26, it scored just 21%, barely edging past Opus 4.6 and still behind GPT 5.5!
Will Mythos be a step-change on FutureSim as it is for coding benchmarks?
I hope more people read Section 5 of our paper.
It's easy to generate a ranking among models with a benchmark.
We do that properly (sec 4), but really the main point is all the research (sec 5) that can be done on top of this very new (temporal + open-ended) way to do evals
Introducing FutureSim, the first interactive environment testing agents on predicting world events.
We build a simulation where agents face forecasting questions over the course of 3 months. News articles come in each day and agents continuously revise their prediction in light of new information as we show below for GPT-5.5. (1/5)
What else have we been up to? As models get better and work over longer and longer time horizons, how do we even evaluate how well they can act and adapt?
One domain we really like there is forecasting, as a hard task that test reasoning under uncertainty.
We've made a benmchmark out of this, where we simulate a whole 3 month period of news, and sanboxed let models continuously read news from those days, plan, and update their forecasts. (see the animation below, just don't be fooled by its speed, this is a slice of the larger 12m token trajectory)
Many more details linked below:
💥 Check out our new paper: FutureSim: Replaying World Events to Evaluate Adaptive Agents.
We create a *reproducible* long-horizon environment where agents have to make forecasts during a 3-month period.
The best performing agent, GPT 5.5 in Codex, consumes 3700 turns and 12.4M tokens spanning many sequential context window compactions in a single run.
(Led by @ShashwatGoel7, @nikhilchandak29, @arvindh__a!)
Can agents continually adapt their predictions given new information from real-world events across several months?
A very long horizon benchmark:
https://t.co/cGSy5u3wDi
Details👇
Introducing FutureSim: where we replay a temporal slice of the web and let agents forecast real-world events over time 🔮🌎
FutureSim replays the web day by day. Agents start on Jan 1, 2026 (past their knowledge cutoffs) with date-gated access to real news articles and forecast on real-world events resolving over the next 90 days. Around 244K new articles stream in during the simulation. Agents decide which questions to answer, what to search for, and when to advance to the next day 🤔
We evaluate frontier models in their native harness. GPT 5.5 (Codex) leads at 25% acc, followed by Opus 4.6 (Claude Code) at 20% 📈 Open weight frontier models have a significant gap to catch up, with DeepSeek V4 pro at 13%, GLM 5.1 at 10%, and Qwen3.6 Plus at 5%
On some questions that have a parallel @Polymarket market, we find that GPT 5.5 in our simulation sometimes beats the crowd aggregate, like in the Super Bowl LX ($704M traded) market 💰💸
FutureSim serves as a test bed for evaluating a lot of important agentic capabilities
> Adaptation: how agents adapt beliefs over time, and handle new incoming information and environment feedback
> Memory: how agents make the best use of external memory to store persistent insights and handle context limitations over a thousand tool calls
> Search: how agents find relevant information over thousands of articles streaming in
> Inference scaling: how agents benefit from scaling inference compute
More cool insights and deep dives in our paper 👇
Can agents continually adapt their beliefs with new information from real-world events?
We provide a testbed for LLM agents to learn to accumulate useful signals across time.
Exciting new directions👇:
• Memory
• Search
• Multi-agent self-play
• Inference Scaling
Continual learning is bottlenecked by realistic evaluations
Introducing FutureSim, which replays real-world events in the temporal order they occurred
We benchmark frontier agents at updating predictions about how our world evolves, in native harnesses like Codex, Claude Code
We’re training models wrong and it’s due to chatGPT. Even the modern coding agents used daily still use message-based exchanges: They send messages to users, to themselves (CoT) and to tools, and receive messages in turn.
This bottlenecks even very intelligent agents to a single stream. The models cannot read while writing, cannot act while thinking and cannot think while processing information.
In our new paper, see below, we discuss LLMs with parallel streams. We show that multi-stream LLMs can …
🔵Be created by instruction-tuning for the stream format
🔵Simplify user and tool use UX removing many pain points with agents and chat models (such as having to interrupt the model to get a word in)
🔵Multi-Stream LLMs are fast, they can predict+read tokens in all streams in parallel in each forward pass, improving latency
🔵 LLMs with multiple streams have an easier time encoding a separation of concerns, improving security
🔵 LLMs with many internal streams provide a legible form of parallel/cont. reasoning. Even if the main CoT stream is accidentally pressured or too focused on a particular task to voice concerns, other internal streams can subvocalize concerns that would otherwise not be verbalized.
Does this sound related to a recent thinky post :) - Yes, but I don’t feel so bad about being outshipped with such a cool report on their side by 23 hours. I’ll link a 2nd thread below with a more direct comparison. I actually think both are complementary in interesting ways.
the tabooification of research ideas in ai safety in this manner is silly. if it helps performance just assume a frontier lab is already doing it, and if a frontier lab is already doing it then it’s good to write papers on it so we can get more eyes on it to fix problems.
🤔 I went to ICLR with a question I had for months: if I were designing a continual learning system today, would I put new knowledge in the weights or in the context? Almost everyone I asked answered "context."
That's a dismissive answer! I have spent years working on in-weight methods, and I do not think gradient-based consolidation is dead, just badly matched to what practitioners in industry actually want from continual learning, which is high-fidelity recall of past interactions.
Fortunately, a position paper from a 24-author Dagstuhl group landed in my feed and argued, more carefully than I had been managing on my own, that the right answer is neither.
In-context learning is for fast adaptation and lossless recall. In-weight learning is for slow consolidation of skill. The real research problem is the modular memory between them, deciding what gets promoted from context into the weights.
Hopefully the community will now ask less about "ICL or IWL" and more about "what is the right promotion policy, and on what evidence."
📄 Modular Memory is the Key to Continual Learning Agents
#ContinualLearning #ICLR2026 #MachineLearning #FoundationModels
My first paper is now on arXiv: Instrumental Choices.
We ask a simple question: when an LLM agent can finish a real task by following the rules or by taking a useful policy-violating shortcut, which path does it choose?
Europe does not lack innovation.
It lacks scale.
European universities produce world-class research, engineers and technology. But too many companies remain trapped inside fragmented national markets instead of scaling immediately across the continent.
The numbers are clear:
→ EU private R&D investment growth has slowed sharply
→ Europe’s share of global corporate R&D investment has fallen from 21.4% in 2014 to 16.2% in 2024
→ Europe still has too few large tech champions because companies face fragmented regulation, smaller capital pools and slower growth financing
→ Startups must expand country by country instead of scaling through one fully integrated market
Europe’s innovation problem is not creativity. It is market size, capital depth and speed of scaling.
A continent with world-class talent cannot keep turning great research into small companies.
Europe needs one real market for innovation.
wrote up some random experiments I did playing around w/ absolute zero at the start of the year: https://t.co/DmVKSjY9YK
a little negative which I attribute mainly to skill issues on my part but potentially interesting to some :)
The links to the mentioned leaderboards:
https://t.co/2DkbRKdnKx
https://t.co/LOA4INJL01
https://t.co/xqtZFz2kf6
PostTrainBench is probably the best out of those three.
CORE-Bench is already saturated and MLE-Bench is also already likely at ~75-85% with Mythos and GPT-5.5
Other ML/AI related benchmarks worth tracking:
https://t.co/38CA9RAOvw
https://t.co/Q10TCgVzbj
https://t.co/73HRMHYbV5
For time-horizons / super long-context:
https://t.co/SSMxTPaFjw
https://t.co/Koarr7L9fO
https://t.co/3j3IwcXHn9
Huge news: @prior_labs has signed a definitive agreement to be acquired by @SAP.
€1B+ invested over four years to build a globally-leading frontier AI lab for structured data — in Europe, in the open.
Independent entity. Same team, same mission, same open models. A massive boost to what we can do. The mission just got accelerated.
Founders’ statement: https://t.co/7ZEi7q8a8l
(Deal subject to regulatory approval; terms not disclosed.)
I am begging you to look at your data.
Please look at the data
evals worse than expected? look at the data
evals better than expected? *definitely* look at the data
evals about what you expected? believe it or not ....
GPT 5.5 results are out on PostTrainBench!
With reprompting: 28.35% (#2, just behind Opus 4.7 at 28.56%) Without reprompting: 25.02% (#4)
The top 3 are now separated by less than 0.4 points - Opus 4.7, GPT 5.5, and GPT 5.4
Reprompting continues to matter: a 13% relative gain for GPT 5.5, similar to what we saw with GPT 5.4. Near-perfect BFCL score too (99.25%).
https://t.co/bUywrYfisI