Working with agents for the past months has me convinced that outcome-only evaluation is a flawed approach to benchmarking. You need to look at the logs to understand if the agent really did its job!
In our paper Log analysis is necessary for credible evaluation of AI agents, we
➡️introduce a taxonomy of threats to credible evaluation of AI agents (including construct validity and safety evaluation concerns);
➡️outline four key principles for conducting log analysis effectively;
➡️present a case study of how log analysis helped us to find a variety of benchmarking errors on τ-bench; and
➡️give a set of recommendations to improve log analysis quality and adoption.
📄https://t.co/2xKsB4oMaU
More details in @PKirgis's thread below ⬇️
New paper: Log analysis is necessary for credible evaluation of AI agents. Benchmarks tell us what the agent achieved; only logs reveal how and why. As agents grow more capable and benchmarks more open-ended, that distinction will only matter more. 🧵
Paper: https://t.co/7GHgPenoeH
Our evaluations show that frontier AI's cyber capabilities are advancing quickly. The length of cyber tasks frontier models can complete has been doubling every few months, and this rate has become faster over time, with recent models exceeding our previous trends. 🧵
I appreciate the work by @EpochAIResearch@GregHBurnham in flagging and fixing these issues. Finding bugs in evaluations is always disappointing, but in the long run, is necessary (and extremely helpful) for improving evaluations. It also reminds me of the issues we uncovered in CORE-Bench: https://t.co/jj9F3wWMo5
As benchmarks become more complex, analyzing benchmark tasks and agent logs will become more important to ensure the validity of evaluation results. Coincidentally, today we released a paper (led by @PKirgis) on how to do log analysis well. https://t.co/rTcirSHuRO
This builds on all our lessons from the trenches in conducting such evaluations and fixing the issues we found in our own work.
I’m sure we’ll find many other issues in our evals, but genuinely think the evals community will be better off for having developed tools and methods to improve eval rigor.
Language models read their own outputs as evidence for their current persona, sometimes entrenching it.
Cozmin Ududec (@CUdudec) leads the Science of Evaluation team at UK AISI and is taking on Pivotal fellows to study how personas carry over, stabilise, drift, or compound across long conversations.
A hill that I will die on: with today's AI models, intelligence is a function of inference compute. Comparing models by a single number hasn't made sense since 2024. What matters is intelligence per token or per $.
This is especially true when using it in a product like Codex.
We (@AISecurityInst) tested GPT-5.5 for its cyber capabilities and safeguards. It's the strongest performing model we've tested on our narrow cyber tasks and solved one of our cyber ranges in 1/10 attempts. We found a universal jailbreak with 6 hours of expert red teaming.
This paper makes a strong case for open-world evaluations as a complement to traditional benchmarks, particularly for realistic, long-horizon, open-ended settings!
Glad the AISI SoE team could contribute to this effort.
Benchmarks are saturated more quickly than ever. How should frontier AI evaluations evolve? In a new paper, we argue that the AI community is already converging on an answer: Open-world evaluations. They are long, messy, real-world tasks that would be impractical for benchmarks.
More broadly: are there better ways to run these expensive, low-sample evaluations to get more insight efficiently?
One idea is to run an episode end-to-end once, then return to an intermediate progress state, branch, and sample more heavily from that point.
Could designs like this help us estimate time-horizons, inference-scaling efficiency, robustness, and harness effects?
This growing variance of solved step at a given budget (or variance in tokens to reach a step) could be a big issue for estimating performance on very long-horizon tasks at very large token budgets.
One thing I find interesting about this result is the large gap between the best run (dashed red line), and the average over 10 runs (solid heavy red line) for Mythos.
At around 80M tokens, the best run is finished, while the average is still at step 20.
Put another way, there is a huge variance in the random variable `log(token) to solve step n`!
One other thought is we likely need to change how we think about measuring performance. Instead of average success rates, it should likely be something like an efficiency metric ($ cost/solve, or the slope of the inference curve).
Another nice example of the increasing effectiveness of inference scaling on very long and hard tasks, and fast saturation on new tasks!
In Nov 2025, we changed our default budget from 10M to 100M tokens for some cyber tasks...which already seems too little.
@tmkadamcz and I started working on MirrorCode, a new long-horizon software engineering benchmark, last September. I think it’s the best benchmark for measuring AI’s ability to complete very hard (but precisely specified) software tasks—but it’s likely already saturated.
All evaluations used a 2M token budget. That is not enough. GPT-5.3 Codex jumps from 3.1h [1.7h, 6.8h] at 2M to 10.5h [2.4h, 63.5h] at 10M tokens. The error bars at 10M are wide because the benchmarks are saturating.
Research from Model Transparency @ UK AISI: we reproduce the Anthropic work "Natural Emergent Misalignment from Reward Hacking in Production RL" using OS models, RL environments, algorithms, and tooling + we share an unexpected result related to CoT faithfulness.
🧵 (1 of 7)
This is currently my favourite way to present eval results: inference scaling curves, across model generations, split by task difficulty.
You can easily see the impact of token budgets, how performance becomes more log-linear over time, and how recent model performance on hard tasks looks like older model performance on easy tasks...
🔓 Can today’s AI agents escape sandbox environments?
Using our new benchmark, SandboxEscapeBench, we find that frontier models can reliably exploit common vulnerabilities - and that breakout capability improves as model size and inference compute increase.
Read more ⬇️