Current frontier models are increasingly saturating common AI benchmarks. Are they still useful? We think benchmarks remain important, but they can both over- and understate AI capabilities. To better survey this space, the field is turning to a new paradigm: open-world evals.
At last week's developer conference, Google claimed that their newest frontier model produced an operating system from just a single prompt and ~$900 in API cost. At first sight, this seems impressive. But on closer look, the evidence is much thinner than the headline suggests.
Most notably:
- The "single prompt" framing suggests that the agent could do this from just a few sentences with high-level instructions. But the prompt itself is many thousands of lines long and it is unclear what instructions Google provided in the prompt (and how much effort it took to even come up with the prompt in the first place).
- There is a lot of OS code on the internet and it is often attempted as a class project in college OS classes. Based on the information provided in Google's blog post, it is unclear to what extent the agent simply copied a well-known implementation from the internet.
- In such long-running, complex implementation tasks, it is important to understand what degree of human intervention was performed to help the agent achieve its goals. However, Google remains ambiguous about the level of hand-holding they performed in this experiment. They say that "no additional guidance or corrections from a human" were necessary, yet they document instances of imposing anti-cheating mechanisms between runs.
- Many key artifacts, such as the code, the prompt, and agent logs are unreleased. This makes it impossible for external researchers to verify these marketing claims.
- To Google's credit, they did release the overall cost and token budget. These details often remain undisclosed, and sharing them is a first step in the right direction.
More detail in our writeup at https://t.co/mItAYM00pZ
w/ @sayashk@RishiBommasani Andrew Schwartz @random_walker
Working with agents for the past months has me convinced that outcome-only evaluation is a flawed approach to benchmarking. You need to look at the logs to understand if the agent really did its job!
In our paper Log analysis is necessary for credible evaluation of AI agents, we
➡️introduce a taxonomy of threats to credible evaluation of AI agents (including construct validity and safety evaluation concerns);
➡️outline four key principles for conducting log analysis effectively;
➡️present a case study of how log analysis helped us to find a variety of benchmarking errors on τ-bench; and
➡️give a set of recommendations to improve log analysis quality and adoption.
📄https://t.co/2xKsB4oMaU
More details in @PKirgis's thread below ⬇️
New paper: Log analysis is necessary for credible evaluation of AI agents. Benchmarks tell us what the agent achieved; only logs reveal how and why. As agents grow more capable and benchmarks more open-ended, that distinction will only matter more. 🧵
Paper: https://t.co/7GHgPenoeH
Most agentic benchmarks center around tasks that are automatically verifiable.
But any task that is veriafiable is also easy to optimize for.
This work instead describes the future of critical open world evaluations.
Led by @sayashk, our current draft is now live.
Sadly wont be at ICLR but if you are make sure to check out our model cascading work!
Big LLMs give great answers but they're costly. Small LLMs are fast but weaker. What if you could get the quality of the big one at the latency of the small one most of the time?
Meet CASCADIA, a novel cascade serving framework designed explicitly to schedule request
routing and deploy model cascades for fast, quality-preserving LLM serving.
Also, our plan isn't fixed: it reshapes itself to the quality bar. Drop the quality target from 90 → 85 on the same trace, and Cascadia routes 21% (not 50%) to the 671B and reallocates 4 of its GPUs to the smaller models. So the same system yields a very different cascade.
Glad to be a part of this initiative to develop open-world evaluations for AI. We need the ability to assess just how capable agents are becoming in order to anticipate and respond to the impact they can have on real world systems and transactions. An agent that can successfully act on the general instruction “build an app and get it posted in the App Store” is one that brings us closer to an economy of agents, with significant implications for how markets behave and need regulating https://t.co/JIJ7fSydiT
Yesterday, we announced CRUX, a project that aims to conduct regular “open-world evaluations,” where we will be testing the ability of AI agents to complete long-horizon tasks in messy, real-world environments. @sayashk's post dives into the details; here are a few of my own thoughts about why this is worth doing.
This paper makes a strong case for open-world evaluations as a complement to traditional benchmarks, particularly for realistic, long-horizon, open-ended settings!
Glad the AISI SoE team could contribute to this effort.
Benchmarks are saturated more quickly than ever. How should frontier AI evaluations evolve? In a new paper, we argue that the AI community is already converging on an answer: Open-world evaluations. They are long, messy, real-world tasks that would be impractical for benchmarks.
📢📢A double launch today! We’re releasing a paper analyzing the rapidly growing trend of “open-world evaluations” for measuring frontier AI capabilities. We’re also launching a new project, CRUX (Collaborative Research for Updating AI eXpectations), an effort to regularly conduct such evaluations ourselves.
I think open-world evals are the most important development in AI evaluation over the past year. Our paper explains why we need them, what they can and can’t tell us, and how to do them well.
In CRUX #1, we tasked an agent with building and publishing a simple iOS app to the Apple App store. The paper has many “lessons from the trenches” from running this experiment. We hope you find it interesting! CRUX #2 will be about AI R&D automation.
The core team is @sayashk, @PKirgis, @steverab, Andrew Schwartz, and me. We’re delighted to have assembled an amazing group of collaborators, many of whom have conducted important open-world evaluations: @fly_upside_down, @RishiBommasani, @DubMagda, @ghadfield, @ahall_research, @sarahookr, @sethlazar, @snewmanpv, @DimitrisPapail, @shostekofsky, @hlntnr, and @CUdudec.
Paper: https://t.co/M15jgh4PCP
HTML version: https://t.co/iuVW7RAlr5
CRUX website: https://t.co/g937gpS65j
Current frontier models are increasingly saturating common AI benchmarks. Are they still useful? We think benchmarks remain important, but they can both over- and understate AI capabilities. To better survey this space, the field is turning to a new paradigm: open-world evals.
📈𝗕𝗲𝘆𝗼𝗻𝗱 𝗕𝗲𝗻𝗰𝗵𝗺𝗮𝗿𝗸𝘀
A robust evaluation ecosystem requires both approaches! We still need bottom-up testing with detailed, task-level specifications. But we must pair this with top-down testing: long-running tasks that test how agents handle real-world ambiguity.