terminalbench @terminalbench - Twitter Profile

15 days ago

deadline to submit tasks for Terminal-Bench 3.0 is may 31st! the best tasks are the most interesting to measure: realistic + useful + meaningfully beyond current frontier any piece of valuable work done on a computer is fair game

1

6

2

0

692

terminalbench @terminalbench

15 days ago

Contribute to Terminal-Bench Science!

Steven Dillmann

@StevenDillmann

15 days ago

📣 Announcing Terminal-Bench Science: benchmarking AI agents on real scientific workflows – now open for task contributions👇 https://t.co/MSPMwnbhVt @AnthropicAI, @OpenAI, and @GoogleDeepMind use Terminal-Bench to evaluate AI on coding tasks. We're now extending it to scientific workflows. 1/6🧵

StevenDillmann's tweet photo. 📣 Announcing Terminal-Bench Science: benchmarking AI agents on real scientific workflows – now open for task contributions👇

https://t.co/MSPMwnbhVt

@AnthropicAI, @OpenAI, and @GoogleDeepMind use Terminal-Bench to evaluate AI on coding tasks. We're now extending it to scientific workflows.

1/6🧵

16

492

112

270

901K

1

6

0

2

276

terminalbench @terminalbench

29 days ago

Thank you to @ekellbuch for leading TB2.1, @Zai_org for Terminal-Bench 2.0 Verified, which informed 11 of the 28 tasks we patched, and @SnorkelAI and @togethercompute for support

0

16

2

0

854

terminalbench @terminalbench

29 days ago

We're releasing Terminal-Bench 2.1 to patch 28 of the 89 tasks in Terminal-Bench 2.0 TB2.1 includes • recalibrated limits • fixed solutions • realigned verifiers Per-task breakdowns in 🧵 We'll continue to support TB2 and TB2.1 leaderboards (new submission process 🔜)

terminalbench's tweet photo. We're releasing Terminal-Bench 2.1 to patch 28 of the 89 tasks in Terminal-Bench 2.0

TB2.1 includes

• recalibrated limits
• fixed solutions
• realigned verifiers

Per-task breakdowns in 🧵

We'll continue to support TB2 and TB2.1 leaderboards (new submission process 🔜) https://t.co/NeNUny3v9t

2

52

12

10

14K

terminalbench @terminalbench

29 days ago

https://t.co/wIWyJrIRLj

2

7

0

619

terminalbench retweeted

Alex Shaw

@alexgshaw

about 1 month ago

The Terminal-Bench community discovered multiple instances of cheating and reward hacking on the Terminal-Bench 2.0 leaderboard. We're adding some new policies to keep it reliable: • ATIF trajectories required for all passing trials • Reward hacking results in reward 0 for the trial • Cheating results in immediate leaderboard removal Thanks to @davisbrownr, @adamlsteinl, and @NoCommas for flagging the recent occurrences! Detailed blog post in comments ⬇️

4

120

11

29

12K

terminalbench retweeted

Alex Shaw

@alexgshaw

3 months ago

We independently verified these claims and removed OpenBlocks from the Terminal-Bench 2.0 leaderboard. Thank you @NoCommas for helping us keep leaderboard entries honest! Recent leaderboard submissions are in https://t.co/q0Vf1AlR1q which makes it easy for the community to work together to detect cheating.

13

235

19

38

30K

terminalbench

@terminalbench

Last Seen Users on Sotwe

Trends for you

Most Popular Users