deadline to submit tasks for Terminal-Bench 3.0 is may 31st!
the best tasks are the most interesting to measure: realistic + useful + meaningfully beyond current frontier
any piece of valuable work done on a computer is fair game
📣 Announcing Terminal-Bench Science: benchmarking AI agents on real scientific workflows – now open for task contributions👇
https://t.co/MSPMwnbhVt
@AnthropicAI, @OpenAI, and @GoogleDeepMind use Terminal-Bench to evaluate AI on coding tasks. We're now extending it to scientific workflows.
1/6🧵
Thank you to @ekellbuch for leading TB2.1, @Zai_org for Terminal-Bench 2.0 Verified, which informed 11 of the 28 tasks we patched, and @SnorkelAI and @togethercompute for support
We're releasing Terminal-Bench 2.1 to patch 28 of the 89 tasks in Terminal-Bench 2.0
TB2.1 includes
• recalibrated limits
• fixed solutions
• realigned verifiers
Per-task breakdowns in 🧵
We'll continue to support TB2 and TB2.1 leaderboards (new submission process 🔜)
The Terminal-Bench community discovered multiple instances of cheating and reward hacking on the Terminal-Bench 2.0 leaderboard.
We're adding some new policies to keep it reliable:
• ATIF trajectories required for all passing trials
• Reward hacking results in reward 0 for the trial
• Cheating results in immediate leaderboard removal
Thanks to @davisbrownr, @adamlsteinl, and @NoCommas for flagging the recent occurrences!
Detailed blog post in comments ⬇️
We independently verified these claims and removed OpenBlocks from the Terminal-Bench 2.0 leaderboard.
Thank you @NoCommas for helping us keep leaderboard entries honest!
Recent leaderboard submissions are in https://t.co/q0Vf1AlR1q which makes it easy for the community to work together to detect cheating.