Stephan Rabanser

@steverab

Postdoctoral Researcher @Princeton. Reliable, safe, trustworthy machine learning. Previously: @UofT @VectorInst @TU_Muenchen @Google @awscloud

Princeton, NJ

Joined April 2010

383 Following

705 Followers

10.1K Posts

Pinned Tweet

Stephan Rabanser @steverab

about 2 months ago

Current frontier models are increasingly saturating common AI benchmarks. Are they still useful? We think benchmarks remain important, but they can both over- and understate AI capabilities. To better survey this space, the field is turning to a new paradigm: open-world evals.

steverab's tweet photo. Current frontier models are increasingly saturating common AI benchmarks. Are they still useful? We think benchmarks remain important, but they can both over- and understate AI capabilities. To better survey this space, the field is turning to a new paradigm: open-world evals. https://t.co/LsDg0rWfvp

Stephan Rabanser @steverab

9 days ago

At last week's developer conference, Google claimed that their newest frontier model produced an operating system from just a single prompt and ~$900 in API cost. At first sight, this seems impressive. But on closer look, the evidence is much thinner than the headline suggests. Most notably: - The "single prompt" framing suggests that the agent could do this from just a few sentences with high-level instructions. But the prompt itself is many thousands of lines long and it is unclear what instructions Google provided in the prompt (and how much effort it took to even come up with the prompt in the first place). - There is a lot of OS code on the internet and it is often attempted as a class project in college OS classes. Based on the information provided in Google's blog post, it is unclear to what extent the agent simply copied a well-known implementation from the internet. - In such long-running, complex implementation tasks, it is important to understand what degree of human intervention was performed to help the agent achieve its goals. However, Google remains ambiguous about the level of hand-holding they performed in this experiment. They say that "no additional guidance or corrections from a human" were necessary, yet they document instances of imposing anti-cheating mechanisms between runs. - Many key artifacts, such as the code, the prompt, and agent logs are unreleased. This makes it impossible for external researchers to verify these marketing claims. - To Google's credit, they did release the overall cost and token budget. These details often remain undisclosed, and sharing them is a first step in the right direction. More detail in our writeup at https://t.co/mItAYM00pZ w/ @sayashk @RishiBommasani Andrew Schwartz @random_walker

Stephan Rabanser @steverab

21 days ago

Working with agents for the past months has me convinced that outcome-only evaluation is a flawed approach to benchmarking. You need to look at the logs to understand if the agent really did its job! In our paper Log analysis is necessary for credible evaluation of AI agents, we ➡️introduce a taxonomy of threats to credible evaluation of AI agents (including construct validity and safety evaluation concerns); ➡️outline four key principles for conducting log analysis effectively; ➡️present a case study of how log analysis helped us to find a variety of benchmarking errors on τ-bench; and ➡️give a set of recommendations to improve log analysis quality and adoption. 📄https://t.co/2xKsB4oMaU More details in @PKirgis's thread below ⬇️

Peter Kirgis @PKirgis

22 days ago

New paper: Log analysis is necessary for credible evaluation of AI agents. Benchmarks tell us what the agent achieved; only logs reveal how and why. As agents grow more capable and benchmarks more open-ended, that distinction will only matter more. 🧵 Paper: https://t.co/7GHgPenoeH

PKirgis's tweet photo. New paper: Log analysis is necessary for credible evaluation of AI agents. Benchmarks tell us what the agent achieved; only logs reveal how and why. As agents grow more capable and benchmarks more open-ended, that distinction will only matter more. 🧵

Paper: https://t.co/7GHgPenoeH

100

18K

steverab retweeted

Sara Hooker

@sarahookr

about 1 month ago

Most agentic benchmarks center around tasks that are automatically verifiable. But any task that is veriafiable is also easy to optimize for. This work instead describes the future of critical open world evaluations. Led by @sayashk, our current draft is now live.

189

169

41K

Who to follow

Stefano Ermon

@StefanoErmon

AI Prof @Stanford | CEO & Cofounder @_inception_ai | Co-inventor of DDIM, FlashAttention, DPO, GAIL, and score-based/diffusion models

Sanjeev Arora

@prfsanjeevarora

Director, @PrincetonPLI and Professor @PrincetonCS. Seeks math/conceptual understanding of deep learning and large AI models. Also on the "other" social network

Francesco Locatello

@FrancescoLocat8

Assistant prof at ISTA. Prev: AWS, PhD at ETH Zürich/Max Planck Institute for Intelligent Systems.

Stephan Rabanser @steverab

about 1 month ago

Check out our poster on Sat, Apr 25, 2026 11:15 AM – 1:45 PM PDT in Pavilion 3 P3-#1625! ICLR link: https://t.co/xus1nBmajf Paper: https://t.co/7tsZ154cJf Joint work with @youheyork, Fangcheng Fu, @Renee42581826 (lead authors) and @Jintao_Zhang_, @niclane7, @Hades317.

134

Stephan Rabanser @steverab

about 1 month ago

Sadly wont be at ICLR but if you are make sure to check out our model cascading work! Big LLMs give great answers but they're costly. Small LLMs are fast but weaker. What if you could get the quality of the big one at the latency of the small one most of the time? Meet CASCADIA, a novel cascade serving framework designed explicitly to schedule request routing and deploy model cascades for fast, quality-preserving LLM serving.

steverab's tweet photo. Sadly wont be at ICLR but if you are make sure to check out our model cascading work!

Big LLMs give great answers but they're costly. Small LLMs are fast but weaker. What if you could get the quality of the big one at the latency of the small one most of the time?

Meet CASCADIA, a novel cascade serving framework designed explicitly to schedule request
routing and deploy model cascades for fast, quality-preserving LLM serving.

Stephan Rabanser @steverab

about 1 month ago

Also, our plan isn't fixed: it reshapes itself to the quality bar. Drop the quality target from 90 → 85 on the same trace, and Cascadia routes 21% (not 50%) to the 671B and reallocates 4 of its GPUs to the smaller models. So the same system yields a very different cascade.

steverab's tweet photo. Also, our plan isn't fixed: it reshapes itself to the quality bar. Drop the quality target from 90 → 85 on the same trace, and Cascadia routes 21% (not 50%) to the 671B and reallocates 4 of its GPUs to the smaller models. So the same system yields a very different cascade. https://t.co/v9mba97xsj

steverab retweeted

Gillian Hadfield

@ghadfield

about 2 months ago

Glad to be a part of this initiative to develop open-world evaluations for AI. We need the ability to assess just how capable agents are becoming in order to anticipate and respond to the impact they can have on real world systems and transactions. An agent that can successfully act on the general instruction “build an app and get it posted in the App Store” is one that brings us closer to an economy of agents, with significant implications for how markets behave and need regulating https://t.co/JIJ7fSydiT

steverab retweeted

Peter Kirgis @PKirgis

about 2 months ago

Yesterday, we announced CRUX, a project that aims to conduct regular “open-world evaluations,” where we will be testing the ability of AI agents to complete long-horizon tasks in messy, real-world environments. @sayashk's post dives into the details; here are a few of my own thoughts about why this is worth doing.

steverab retweeted

Cozmin Ududec

@CUdudec

about 2 months ago

This paper makes a strong case for open-world evaluations as a complement to traditional benchmarks, particularly for realistic, long-horizon, open-ended settings! Glad the AISI SoE team could contribute to this effort.

steverab retweeted

Sayash Kapoor @sayashk

about 2 months ago

Benchmarks are saturated more quickly than ever. How should frontier AI evaluations evolve? In a new paper, we argue that the AI community is already converging on an answer: Open-world evaluations. They are long, messy, real-world tasks that would be impractical for benchmarks.

sayashk's tweet photo. Benchmarks are saturated more quickly than ever. How should frontier AI evaluations evolve? In a new paper, we argue that the AI community is already converging on an answer: Open-world evaluations. They are long, messy, real-world tasks that would be impractical for benchmarks. https://t.co/CrvbEd9l7f

252

183

94K

steverab retweeted

Arvind Narayanan

@random_walker

about 2 months ago

📢📢A double launch today! We’re releasing a paper analyzing the rapidly growing trend of “open-world evaluations” for measuring frontier AI capabilities. We’re also launching a new project, CRUX (Collaborative Research for Updating AI eXpectations), an effort to regularly conduct such evaluations ourselves. I think open-world evals are the most important development in AI evaluation over the past year. Our paper explains why we need them, what they can and can’t tell us, and how to do them well. In CRUX #1, we tasked an agent with building and publishing a simple iOS app to the Apple App store. The paper has many “lessons from the trenches” from running this experiment. We hope you find it interesting! CRUX #2 will be about AI R&D automation. The core team is @sayashk, @PKirgis, @steverab, Andrew Schwartz, and me. We’re delighted to have assembled an amazing group of collaborators, many of whom have conducted important open-world evaluations: @fly_upside_down, @RishiBommasani, @DubMagda, @ghadfield, @ahall_research, @sarahookr, @sethlazar, @snewmanpv, @DimitrisPapail, @shostekofsky, @hlntnr, and @CUdudec. Paper: https://t.co/M15jgh4PCP HTML version: https://t.co/iuVW7RAlr5 CRUX website: https://t.co/g937gpS65j

random_walker's tweet photo. 📢📢A double launch today! We’re releasing a paper analyzing the rapidly growing trend of “open-world evaluations” for measuring frontier AI capabilities. We’re also launching a new project, CRUX (Collaborative Research for Updating AI eXpectations), an effort to regularly conduct such evaluations ourselves.

I think open-world evals are the most important development in AI evaluation over the past year. Our paper explains why we need them, what they can and can’t tell us, and how to do them well.

In CRUX #1, we tasked an agent with building and publishing a simple iOS app to the Apple App store. The paper has many “lessons from the trenches” from running this experiment. We hope you find it interesting! CRUX #2 will be about AI R&D automation.

The core team is @sayashk, @PKirgis, @steverab, Andrew Schwartz, and me. We’re delighted to have assembled an amazing group of collaborators, many of whom have conducted important open-world evaluations: @fly_upside_down, @RishiBommasani, @DubMagda, @ghadfield, @ahall_research, @sarahookr, @sethlazar, @snewmanpv, @DimitrisPapail, @shostekofsky, @hlntnr, and @CUdudec.

Paper: https://t.co/M15jgh4PCP
HTML version: https://t.co/iuVW7RAlr5
CRUX website: https://t.co/g937gpS65j

12K

Stephan Rabanser @steverab

about 2 months ago

📄Paper draft: https://t.co/SL7pINxwNV 📝Substack essay: https://t.co/Pl0zQ90kTy 🕸️Website: https://t.co/mhfHfl4knV 🪵Full agent logs: https://t.co/NL8CF9KxXA 💡Share your own CRUX ideas: https://t.co/QvJUQtnT7M We are excited to run more instances of CRUX in the future! Grateful to have worked on this with many collaborators across academia, industry, non-profits, and government: @sayashk, @PKirgis, Andrew Schwartz, @random_walker, @fly_upside_down, @RishiBommasani, @DubMagda, @ghadfield, @ahall_research, @sarahookr, @sethlazar, @snewmanpv, @DimitrisPapail, @shostekofsky, @hlntnr, @CUdudec!

731

Stephan Rabanser @steverab

about 2 months ago

Stephan Rabanser @steverab

about 2 months ago

📈𝗕𝗲𝘆𝗼𝗻𝗱 𝗕𝗲𝗻𝗰𝗵𝗺𝗮𝗿𝗸𝘀 A robust evaluation ecosystem requires both approaches! We still need bottom-up testing with detailed, task-level specifications. But we must pair this with top-down testing: long-running tasks that test how agents handle real-world ambiguity.

steverab's tweet photo. 📈𝗕𝗲𝘆𝗼𝗻𝗱 𝗕𝗲𝗻𝗰𝗵𝗺𝗮𝗿𝗸𝘀
A robust evaluation ecosystem requires both approaches! We still need bottom-up testing with detailed, task-level specifications. But we must pair this with top-down testing: long-running tasks that test how agents handle real-world ambiguity. https://t.co/UL9ypULygn

107

Stephan Rabanser

@steverab

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users