๐New @scale_AI paper: ๐ฅ๐ฒ๐๐ฒ๐ฎ๐ฟ๐ฐ๐ต๐ฅ๐๐ฏ๐ฟ๐ถ๐ฐ๐, a benchmark for evaluating Deep Research (DR) agents. Even top agents like Gemini & OpenAI DR achieve <๐ฒ๐ด% ๐ฟ๐๐ฏ๐ฟ๐ถ๐ฐ ๐ฐ๐ผ๐บ๐ฝ๐น๐ถ๐ฎ๐ป๐ฐ๐ฒ. We built ๐ฎ.๐ฑ๐+ expert rubrics with ๐ฎ.๐ด๐+ hrs of human labor to measure why.
i'll be attending @iclr_conf in Rio this week! ๐ง๐ท
excited to be presenting ResearchRubrics at the 4/23 Thurs afternoon poster session, and a poster on rubric-benchmark sensitivity at the Agents in the Wild workshop
Can AI do real digital work? We reviewed three benchmarks to find out: RLI, GDPval, and APEX-Agents.
Our take: progress here will indicate substantial economic value, but tasks are too self-contained to tell us about wholesale automation.
Thread for more:
TL;DR: Spreadsheet generation is multi-dimensional.
Human preference data captures what users actually value, but different dimensions matter across domains, and some signals surface more clearly than others.
Spreadsheet Arena gives us a powerful foundation for evaluation, and a new lens for improving post-training.
Start a battle at https://t.co/MHGdWX6xxm
Read the paper at https://t.co/6GX4ify8CS
@srkundurthy@claranahhh@Zachkirshner@calvincbzhang@ManasiSharma_@jhnling
New paper from @scale_AI & @MeridianAgent: SpreadsheetArena ๐
We evaluated 16 LLMs on end-to-end spreadsheet generation via 4,300+ blind pairwise votes.
Crucially, we move beyond scalar Elo ratings to decompose the latent preference signal into functional, structural, and stylistic components. ๐งต
๐๏ธ In our latest Chain of Thought episode we unpack ResearchRubrics, our benchmark for evaluating deep research agent performance.
We explore what meaningful agent evaluation looks like, where todayโs agents still fall short, and why clearer evaluation frameworks are critical as agent use accelerates.
i also recently sat down with @calvincbzhang and
@Bckenstler on the CoT podcast at @scale_AI to discuss Deep Research agents & the ResearchRubrics paper in particular ๐งต
we dug into (1) how to assess deep research queries & reports, (2) where todayโs agents still fall short, and (3) why clearer evaluation frameworks are critical as real-world agent use accelerates. :)
๐New @scale_AI paper: ๐ฅ๐ฒ๐๐ฒ๐ฎ๐ฟ๐ฐ๐ต๐ฅ๐๏ฟฝ๏ฟฝ๐ฟ๐ถ๐ฐ๐, a benchmark for evaluating Deep Research (DR) agents. Even top agents like Gemini & OpenAI DR achieve <๐ฒ๐ด% ๐ฟ๐๐ฏ๐ฟ๐ถ๐ฐ ๐ฐ๐ผ๐บ๐ฝ๐น๐ถ๐ฎ๐ป๐ฐ๐ฒ. We built ๐ฎ.๐ฑ๐+ expert rubrics with ๐ฎ.๐ด๐+ hrs of human labor to measure why.
chatted with other folks at NeurIPS last week working on browser / CUA post-training, and it seems like people are shifting toward vision-only agents, which appear to generalize better
i'll be attending NeurIPS in San Diego next week!
excited to chat about agents in general (deep research agents, multimodal browser agents for long-horizon tasks, etc.), evals for open-ended + economically valuable tasks, and RL / post-training
ResearchRubrics is a benchmark that tests how well AI deep research agents handle open ended web questions.
The key finding is that these agents still fall short of expert expectations once checked against rubrics.
The benchmark has 101 prompts from 9 domains, each with a human written checklist about content, reasoning, synthesis, citations, and clarity.
Each checklist item has a weight, some are mandatory, some optional, and a judge marks them as fully met, partly met, or not met.
Tasks get 3 tags describing topic breadth, reasoning depth, and how open ended the goal is.
Top agents stay below 68% compliance because they miss implicit context and fail to connect information across documents, even though language model judges and humans agree when rubrics use concrete examples.
----
Paper โ arxiv. org/abs/2511.07685
Paper Title: "ResearchRubrics: A Benchmark of Prompts and Rubrics For Evaluating Deep Research Agents"
๐New @scale_AI paper: ๐ฅ๐ฒ๐๐ฒ๐ฎ๐ฟ๐ฐ๐ต๐ฅ๐๏ฟฝ๏ฟฝ๐ฟ๐ถ๐ฐ๐, a benchmark for evaluating Deep Research (DR) agents. Even top agents like Gemini & OpenAI DR achieve <๐ฒ๐ด% ๐ฟ๐๐ฏ๐ฟ๐ถ๐ฐ ๐ฐ๐ผ๐บ๐ฝ๐น๐ถ๐ฎ๐ป๐ฐ๐ฒ. We built ๐ฎ.๐ฑ๐+ expert rubrics with ๐ฎ.๐ด๐+ hrs of human labor to measure why.
๐New @scale_AI paper: ๐ฅ๐ฒ๐๐ฒ๐ฎ๐ฟ๐ฐ๐ต๐ฅ๐๐ฏ๐ฟ๐ถ๐ฐ๐, a benchmark for evaluating Deep Research (DR) agents. Even top agents like Gemini & OpenAI DR achieve <๐ฒ๐ด% ๐ฟ๐๐ฏ๐ฟ๐ถ๐ฐ ๐ฐ๐ผ๐บ๐ฝ๐น๐ถ๐ฎ๐ป๐ฐ๐ฒ. We built ๐ฎ.๐ฑ๐+ expert rubrics with ๐ฎ.๐ด๐+ hrs of human labor to measure why.