Dawn Song

@dawnsongtweets

Professor in Computer Science at UC Berkeley, co-Director of Berkeley RDI Center; Building safe, secure, decentralized AI; Serial entrepreneur

Bay Area

Joined March 2012

830 Following

37.5K Followers

605 Posts

Pinned Tweet

Dawn Song

@dawnsongtweets

10 days ago

Everyone says the latest AI agents will be "job-ready" soon, especially after the release of Fable 5 this week. But is that really the case? Over the past many months, my group and collaborators have been building Agents' Last Exam (ALE), a benchmark designed to test exactly that claim on real digital labor-market work. My group and collaborators previously have created many of the benchmarks the field runs on, including MMLU, MATH, CyberGym, and ExploitGym. Today, I'm excited to share Agents' Last Exam (ALE): a rolling benchmark that measures whether AI agents can actually perform economically valuable work across a broad range of real-world domains. With ALE, we evaluated Fable 5, GPT-5.5, Composer 2.5, and other frontier agent systems across more than 1,500 expert-sourced tasks spanning 55 occupations. The result is both impressive and sobering. Today's agents can solve a meaningful fraction of professional tasks. But when we look at the hardest tasks, the ones requiring sustained reasoning, deep domain expertise, and reliable execution over long horizons, they are still far from human-level performance. On ALE's hardest tier, every frontier agent we tested, including Fable 5, achieved a 0% success rate. The age of useful agents is here. The age of truly job-ready agents is not. We hope Agents' Last Exam (ALE) will serve as a new guidepost and north star for developing agents capable of reliably performing economically valuable work across a broad range of domains. 🧵

$dawnsongtweets's tweet photo. Everyone says the latest AI agents will be "job-ready" soon, especially after the release of Fable 5 this week. But is that really the case? Over the past many months, my group and collaborators have been building Agents' Last Exam (ALE), a benchmark designed to test exactly that claim on real digital labor-market work. My group and collaborators previously have created many of the benchmarks the field runs on, including MMLU, MATH, CyberGym, and ExploitGym. Today, I'm excited to share Agents' Last Exam (ALE): a rolling benchmark that measures whether AI agents can actually perform economically valuable work across a broad range of real-world domains. With ALE, we evaluated Fable 5, GPT-5.5, Composer 2.5, and other frontier agent systems across more than 1,500 expert-sourced tasks spanning 55 occupations. The result is both impressive and sobering. Today's agents can solve a meaningful fraction of professional tasks. But when we look at the hardest tasks, the ones requiring sustained reasoning, deep domain expertise, and reliable execution over long horizons, they are still far from human-level performance. On ALE's hardest tier, every frontier agent we tested, including Fable 5, achieved a 0% success rate. The age of useful agents is here. The age of truly job-ready agents is not. We hope Agents' Last Exam (ALE) will serve as a new guidepost and north star for developing agents capable of reliably performing economically valuable work across a broad range of domains. 🧵$

931

195

478

263K

Dawn Song

@dawnsongtweets

10 days ago

ALE is truly a community effort. Huge thanks to a distinguished advisory committee guiding our industry landscape and task collection: @gallantlab, @thg_lab, Tarek Zohdi, Carl Boettiger & @ksteinfe (@UCBerkeley) Laure Zanna, @kaanozbay (@nyuniversity) George Em Karniadakis (@BrownUniversity) Tapio Schneider (@Caltech) @Idasim (@UCSF) Arvind Rao (@UMich) @yannakakis (@UMmalta) Patrick Bryant (@scilifelab) @yaminirangan (@HubSpot) @brad_rothenberg (@nTopology) We are also deeply grateful to @BerkeleyRDI, RDI Foundation, @ChenInstitute, @UniPat_AI, @SnorkelAI (Open Benchmarks Grants program) for their support. A huge thank you as well to our incredible organizing and execution team, and to all of the experts and contributors who donated their time, expertise, and real-world projects to make ALE possible. This simply would not have happened without you.

Dawn Song

@dawnsongtweets

10 days ago

931

195

478

263K

Dawn Song

@dawnsongtweets

10 days ago

Why "Last Exam"? The name has two meanings: "Last" as the bar to clear：passing these exams means an agent can actually do the job and continue to deliver economically-valuable work in that profession. "Last" as the frontier of difficulty：tasks are real, complex, long-horizon, and require professional expertise to execute. ALE sits right at the edge of what today's agents can reliably accomplish. Come test your agent on ALE → Website: https://t.co/YjfsZILxBb Tasks: https://t.co/Pyoud9MffS Leaderboard: https://t.co/eoURCYi8Jx Paper: https://t.co/Ts2xbXcQBL Dataset: https://t.co/EBy9MpJUk1 Code: https://t.co/rcijuQ2Grn

Who to follow

Raluca Ada Popa

@ralucaadapopa

Head of Security and Privacy Research @ Google DeepMind. @UCBerkeley security professor. MIT PhD. Co-founder of @OpaqueSys, @imua & @PreVeil.

Christopher Manning

@chrmanning

Founder @stanfordnlp & cs224n—Senior Fellow @StanfordHAI—Prof. CS & Linguistics @Stanford—GP @aixventureshq—MTS @moonlake—Australian🇦🇺—Do #NLProc & #AI 👋

Tengyu Ma

@tengyuma

Assistant prof. @ Stanford; Chief AI Scientist @ MongoDB; Former Co-founder/CEO of Voyage AI Working on ML, DL, RL, LLMs, and their theory.

Dawn Song

@dawnsongtweets

12 days ago

My group & collaborators have built many of the benchmarks the field now runs on — MMLU, MATH, CyberGym, ExploitGym, etc.. I'm really excited to share our latest: Agents' Last Exam (ALE). Why "Last Exam"? The name has two meanings: "Last" as the bar to clear — passing these exams means an agent can actually do the job and continue to deliver economically-valuable work in that profession. "Last" as the frontier of difficulty — tasks are real, complex, long-horizon, and require professional expertise to execute. ALE sits right at the edge of what today's agents can reliably accomplish. A few things that make ALE different: • Real work, not vibes. Every one of the 1,500+ tasks comes from real projects or research contributed by domain experts. We converted them into verifiable tests and objectively graded evaluations — no human judges required. • Built for breadth. ALE spans 55 non-physical occupations based on the O*NET / SOC 2018 occupational taxonomy, with contributions from 300+ experts across 100+ institutions. • Judged on results, no restriction on process. We evaluate Generalist Computer-Use Agents (GCUAs) with full GUI + CLI access, allowing them to solve tasks however it would — clicking, typing, scripting, browsing, and more. We just grade the outcome. Huge thanks to my postdoc @YiyouSun for spearheading this tremendous effort, and to our esteemed advisory committee, incredible team and collaborators who made it possible. We hope Agents' Last Exam (ALE) will serve as a new guidepost and north star for developing agents capable of reliably performing economically valuable work across a broad range of domains. 🧵👇

Yiyou Sun

@YiyouSun

12 days ago

“AI agents will outperform humans at almost all jobs by 2026–2027.” - The forecast is everywhere. So we built the exam to test that claim, on real labor-market aligned work. On the hardest tier, top agents pass 2.6%. Meet Agents' Last Exam (ALE), a rolling benchmark measuring whether agents can actually do real jobs. 🧵👇

337

104

227

87K

108

21K

Dawn Song

@dawnsongtweets

16 days ago

Static benchmarks inevitably saturate as models improve. Our recent work explores how to evolve existing tasks into harder, validated variants that co-evolve with frontier models — and also serve as RL training data for self-improvement. A step toward benchmarks that evolve as quickly as the models they measure.

dawnsongtweets's tweet photo. Static benchmarks inevitably saturate as models improve.

Our recent work explores how to evolve existing tasks into harder, validated variants that co-evolve with frontier models — and also serve as RL training data for self-improvement.

A step toward benchmarks that evolve as quickly as the models they measure.

Yangzhen Wu

@yangzhen04

18 days ago

Static benchmarks are dying — they tend to get saturated quickly. Evaluation and training data should co-evolve with frontier models. We released BenchEvolver — a framework that automatically evolves saturated problems into harder, verified tasks for evaluating frontier models, which can also serve as useful self-improvement signals for RL. New work from UC Berkeley @berkeley_ai @BerkeleyRDI @BerkeleySky Project Page: https://t.co/PL1KpGyd87 Paper: https://t.co/gBQOXrZbAV

yangzhen04's tweet photo. Static benchmarks are dying — they tend to get saturated quickly.

Evaluation and training data should co-evolve with frontier models.

We released BenchEvolver — a framework that automatically evolves saturated problems into harder, verified tasks for evaluating frontier models, which can also serve as useful self-improvement signals for RL.

New work from UC Berkeley @berkeley_ai @BerkeleyRDI @BerkeleySky

Project Page: https://t.co/PL1KpGyd87
Paper: https://t.co/gBQOXrZbAV

40K

24K

Dawn Song

@dawnsongtweets

26 days ago

Excited to see ExploitGym used to evaluate Claude Mythos Preview’s cyber capabilities. ExploitGym tests a critical question: can AI agents turn real vulnerabilities into working exploits? ExploitGym is the first comprehensive benchmark to systematically evaluate this question, including 898 real-world exploitation tasks across userspace programs, V8, and the Linux kernel, with realistic mitigation toggles. Exploitation remains challenging, but frontier models are making significant progress, underscoring the need for rigorous, reproducible, dual-use cyber evaluations.

Anthropic

@AnthropicAI

30 days ago

Last month we launched Project Glasswing, our collaborative AI cybersecurity initiative. Since then, we and our partners have found more than ten thousand high- or critical-severity vulnerabilities in essential software.

517

652

10K

Dawn Song

@dawnsongtweets

about 1 month ago

6/ Many thanks to our wonderful coauthors and collaborators from UC Berkeley, MPI-SP, UCSB, ASU, Anthropic, OpenAI, and Google for their invaluable contributions. We also sincerely thank the support from the GLM team, as well as everyone who provided feedback and help for this work. Paper: https://t.co/fkVWJbNZvf Blog: https://t.co/ZoX6jjYDhs

Dawn Song

@dawnsongtweets

about 1 month ago

1/ Can AI agents turn security vulnerabilities into real attacks? This is one of the most critical tasks for measuring the impact of frontier AI on cybersecurity. In ExploitGym, we find that autonomous exploitation is no longer hypothetical, even on complex targets such as browser engines and the Linux kernel. How we measured this⬇️

dawnsongtweets's tweet photo. 1/ Can AI agents turn security vulnerabilities into real attacks?

This is one of the most critical tasks for measuring the impact of frontier AI on cybersecurity.

In ExploitGym, we find that autonomous exploitation is no longer hypothetical, even on complex targets such as browser engines and the Linux kernel.

How we measured this⬇️

123

21K

Dawn Song

@dawnsongtweets

about 1 month ago

5/ This is deeply dual-use. ExploitGym gives defenders, AI developers, and policymakers a rigorous way to measure frontier cyber capabilities and reason about risk.

Dawn Song

@dawnsongtweets

about 1 month ago

Excited to share DecodingTrust-Agent Platform (DTap), the first controllable, full-stack simulation platform for advanced AI agent red-teaming across 50+ realistic environments. DTap supports multiple attack vectors, including environment-, tool-, skill-, and prompt-level injections, as well as their compositions. We also build DTap-Bench, a ~7K-task benchmark with complex workflows and sophisticated attacks for evaluating agent security and utility under realistic threat scenarios. Through DTap, we uncover systematic vulnerabilities and zero-day failure modes in popular agents such as OpenClaw and Claude Code, and provide insights on how to improve harness design, tool execution, and trust calibration for more robust agentic systems. Read our paper to learn more 👇 Paper link: https://t.co/TeCeqfDsd9 Platform + benchmark + code: https://t.co/dlyeMoTba3 Great work by the team!

Zhaorun Chen

@ZRChen_AISafety

about 1 month ago

AI agents are already going wild, but today’s red-teaming tools for them are still like toys 😢 🔥👽 After spending 20 months and $120K API credits, we are excited to finally open-source DecodingTrust-Agent Platform (DTap): the first controllable, realistic simulation platform for advanced AI agent red-teaming !! 🌍 DTap simulates 50+ real-world environments across 14 high-stakes domains, with realistic agent interfaces replicated from their official MCPs and GUIs. The environments are full-stack, interactive, fully parallelizable, and can be easily configured to reproduce arbitrary real-world attack scenarios, making agent red-teaming scalable and highly transferable to deployment settings. 🔥We also release DTap-Bench, a large-scale benchmark with ~7K agent red-teaming tasks and ~4K policy-grounded malicious goals. Each red-teaming task includes a sophisticated attack sequence across environment-, tool-, skill-, prompt-level injections, as well as their compositions, plus a handcrafted verifiable judge that checks the actual consequences in the environment. Using DTap-Bench, we evaluate popular agent frameworks and backbone models across diverse policies, risks, threat models, and attack strategies, revealing systematic vulnerabilities and zero-days in today’s agents! Paper link: https://t.co/PjnGC5wKk9 Platform + benchmark + code: https://t.co/aicipKMnig Join our Discord: https://t.co/8UyRjH6RqX Read more below 👇

ZRChen_AISafety's tweet photo. AI agents are already going wild, but today’s red-teaming tools for them are still like toys 😢

🔥👽 After spending 20 months and $120K API credits, we are excited to finally open-source DecodingTrust-Agent Platform (DTap): the first controllable, realistic simulation platform for advanced AI agent red-teaming !!

🌍 DTap simulates 50+ real-world environments across 14 high-stakes domains, with realistic agent interfaces replicated from their official MCPs and GUIs. The environments are full-stack, interactive, fully parallelizable, and can be easily configured to reproduce arbitrary real-world attack scenarios, making agent red-teaming scalable and highly transferable to deployment settings.

🔥We also release DTap-Bench, a large-scale benchmark with ~7K agent red-teaming tasks and ~4K policy-grounded malicious goals.

Each red-teaming task includes a sophisticated attack sequence across environment-, tool-, skill-, prompt-level injections, as well as their compositions, plus a handcrafted verifiable judge that checks the actual consequences in the environment.

Using DTap-Bench, we evaluate popular agent frameworks and backbone models across diverse policies, risks, threat models, and attack strategies, revealing systematic vulnerabilities and zero-days in today’s agents!

Paper link: https://t.co/PjnGC5wKk9
Platform + benchmark + code: https://t.co/aicipKMnig
Join our Discord: https://t.co/8UyRjH6RqX

Read more below 👇

100

98K

17K

Dawn Song

@dawnsongtweets

about 1 month ago

Poetiq did it again: SOTA on LiveCodeBench Pro after ARC-AGI and HLE! Truly impressive work from @poetiq_ai, really showing the power of their meta-system for agentic AI!

Poetiq

@poetiq_ai

about 1 month ago

Poetiq's Meta-System built its own coding harness from scratch. It got SOTA on LiveCodeBench Pro. No fine-tuning, no special model access. Just standard APIs. Using Gemini 3.1 Pro, it made a harness that beat all frontier models we tested.

poetiq_ai's tweet photo. Poetiq's Meta-System built its own coding harness from scratch. It got SOTA on LiveCodeBench Pro.

No fine-tuning, no special model access. Just standard APIs. Using Gemini 3.1 Pro, it made a harness that beat all frontier models we tested. https://t.co/v575oUYJeH

551

236

Dawn Song

@dawnsongtweets

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users