📣 Announcing Terminal-Bench Science: benchmarking AI agents on real scientific workflows – now open for task contributions👇
https://t.co/MSPMwnbhVt
@AnthropicAI, @OpenAI, and @GoogleDeepMind use Terminal-Bench to evaluate AI on coding tasks. We're now extending it to scientific workflows.
1/6🧵
New Benchtalks with @jyangballin: on ProgramBench (0% frontier models at launch) and the lineage/future of coding benchmarks, from SWE-bench/InterCode to now
01:29 ProgramBench launch and reception
03:41 Why artifact-level evaluation, not code-level
06:03 Why models love Python
08:29 ProgramBench as a research tool
12:45 From SWE-bench & InterCode to ProgramBench
17:47 How to grade a coding model
21:53 The position paper & humans in the loop
25:01 Managing quality with agents-in-the-loop
28:40 Internet access and benchmark integrity
35:26 Where models may surpass human abilities
38:56 When a model hits 80% on ProgramBench
43:55 Benchmarks worth paying attention to
46:24 What benchmark do you wish existed
49:32 Will benchmarks still look like benchmarks in 5 years
52:02 How to contribute to ProgramBench
CAIS AgenticSE Workshop Keynote Talk: Harbor Adapters & Harbor Index
30min video here:
https://t.co/RJvj5o7udy
Happy to be invited as a keynote speaker and present our recent study on Harbor Adapters and agentic evaluation!
📣 Announcing Terminal-Bench Science: benchmarking AI agents on real scientific workflows – now open for task contributions👇
https://t.co/MSPMwnbhVt
@AnthropicAI, @OpenAI, and @GoogleDeepMind use Terminal-Bench to evaluate AI on coding tasks. We're now extending it to scientific workflows.
1/6🧵
🚨 stop zipping job results 🚨
... upload results to Harbor Hub instead
The hub makes it easy to share results with team members, customers, or simply save for later in a centralized place.
Example of a TB2.1 job in 🧵
Packed room to hear @alexgshaw and @ryanmart3n break down how @harborframework grew into *the* framework for RL environments.
In our RLEval workshop at @CAISconf today, attendees tackled big open challenges in RLEs & Agent Evals + I shared the approach we take at @joinHandshake
the harbor community will be @ CAIS - come say hi!
9am Tue @ RLEval workshop
Harbor & Terminal-Bench 3.0 talk by @alexgshaw / me
10:30am Tue @ RLEval workshop
OpenThoughts-Agent talk by @AlexGDimakis
4pm Tue @ Agent Software Engineering workshop
Harbor Adapters & Harbor Index talk by @LinShi592021
9am Wed: Keynote by @andykonwinski
What sets @VRubinObs apart?
Over 10 years, Rubin will be mapping the southern sky to create a comprehensive map of the cosmos. It will guide astronomers where to look next, and reveal our universe at scales previously unimaginable.
@suragnair@AnthropicAI@OpenAI@GoogleDeepMind no "guarantees" but we provide canary strings for the labs to filter out the data, and the labs that care about measuring capabilities without training on test data will (hopefully) use that
Great article by @TimothyKassis on why we need Terminal-Bench Science - if you’re a scientist and want AI agents to become better in your domain, join us!👇
https://t.co/EwINF4CogN
"AI for science" benchmarks today mostly test textbook recall. Terminal-Bench Science is a chance for scientists to practice writing that definition. Contribute a real workflow, and you find out exactly where today's best agents break on it.
https://t.co/GZ28R5QIRn
Wonderful project; wonderful people; please contribute for the sake of science.
Bonus: @StevenDillmann will be interning with me and AutoDiscovery team @allen_ai translating benefits from TB-Science to our science agents!
Terminal-Bench Science is a direct way to contribute to AI for Science. It's programming agents by task specification. Ask a precise scientific question and watch how AI agents will learn to solve it:
Step 1. Package a scientific task or workflow, something that takes a working scientist a week or month to do into an RL environment.
Step 2. Write tests that verify if the task has been done correctly (can be done easily if you have already solved the task manually).
Step 3. Sit back and let AI agent progress solve it in 6 months.
Scientists, I highly encourage you to submit hard scientific tasks that you want your agents to do to this Terminal-Bench Science benchmark! Make your task seen and solved by agent/model providers. Get credit from the project.