Steven Dillmann @StevenDillmann - Twitter Profile

Pinned Tweet

14 days ago

📣 Announcing Terminal-Bench Science: benchmarking AI agents on real scientific workflows – now open for task contributions👇 https://t.co/MSPMwnbhVt @AnthropicAI, @OpenAI, and @GoogleDeepMind use Terminal-Bench to evaluate AI on coding tasks. We're now extending it to scientific workflows. 1/6🧵

StevenDillmann's tweet photo. 📣 Announcing Terminal-Bench Science: benchmarking AI agents on real scientific workflows – now open for task contributions👇

https://t.co/MSPMwnbhVt

@AnthropicAI, @OpenAI, and @GoogleDeepMind use Terminal-Bench to evaluate AI on coding tasks. We're now extending it to scientific workflows.

1/6🧵

16

494

112

271

903K

Steven Dillmann

@StevenDillmann

about 7 hours ago

Awesome interview on ProgramBench with @jyangballin and @vincentsunnchen - great stuff guys!

vincent sunn chen

@vincentsunnchen

about 8 hours ago

New Benchtalks with @jyangballin: on ProgramBench (0% frontier models at launch) and the lineage/future of coding benchmarks, from SWE-bench/InterCode to now 01:29 ProgramBench launch and reception 03:41 Why artifact-level evaluation, not code-level 06:03 Why models love Python 08:29 ProgramBench as a research tool 12:45 From SWE-bench & InterCode to ProgramBench 17:47 How to grade a coding model 21:53 The position paper & humans in the loop 25:01 Managing quality with agents-in-the-loop 28:40 Internet access and benchmark integrity 35:26 Where models may surpass human abilities 38:56 When a model hits 80% on ProgramBench 43:55 Benchmarks worth paying attention to 46:24 What benchmark do you wish existed 49:32 Will benchmarks still look like benchmarks in 5 years 52:02 How to contribute to ProgramBench

1

35

12

10

11K

2

9

2

4

3K

Steven Dillmann

@StevenDillmann

5 days ago

Check out Lin’s great talk on how we built a unified infrastructure for agentic benchmarks, and why we need it!

Lin Shi @LinShi592021

5 days ago

CAIS AgenticSE Workshop Keynote Talk: Harbor Adapters & Harbor Index 30min video here: https://t.co/RJvj5o7udy Happy to be invited as a keynote speaker and present our recent study on Harbor Adapters and agentic evaluation!

0

25

6

11

20K

0

13

2

1

2K

Steven Dillmann

@StevenDillmann

6 days ago

@BenBlaiszik @AnthropicAI @OpenAI @GoogleDeepMind We will have 1 H100 GPU available for more compute-intensive tasks!

0

1

0

82

Who to follow

astro-ai

@astro_ai_cfa

AstroAI is a center for astrophysical artificial intelligence/machine learning at the @CenterForAstro.

Jarvie Way

@alohaway13

Allen D Warren, MSEd™

@ADWarrenMSEd

All things that interest me.¯\_(ツ)_/¯ MSEd-@universityofky. BA-Chemistry BS-Biology-@eku #SchoolChoice #Spacetravel #HigherEducation #StarGate -19D🇺🇸

Steven Dillmann

@StevenDillmann

14 days ago

📣 Announcing Terminal-Bench Science: benchmarking AI agents on real scientific workflows – now open for task contributions👇 https://t.co/MSPMwnbhVt @AnthropicAI, @OpenAI, and @GoogleDeepMind use Terminal-Bench to evaluate AI on coding tasks. We're now extending it to scientific workflows. 1/6🧵

16

494

112

271

903K

StevenDillmann retweeted

Harbor Framework

@harborframework

7 days ago

🚨 stop zipping job results 🚨 ... upload results to Harbor Hub instead The hub makes it easy to share results with team members, customers, or simply save for later in a centralized place. Example of a TB2.1 job in 🧵

harborframework's tweet photo. 🚨 stop zipping job results 🚨

... upload results to Harbor Hub instead

The hub makes it easy to share results with team members, customers, or simply save for later in a centralized place.

Example of a TB2.1 job in 🧵 https://t.co/hvaAd5zIai

1

18

2

3

4K

StevenDillmann retweeted

Jonas Mueller

@jomulr

8 days ago

Packed room to hear @alexgshaw and @ryanmart3n break down how @harborframework grew into *the* framework for RL environments. In our RLEval workshop at @CAISconf today, attendees tackled big open challenges in RLEs & Agent Evals + I shared the approach we take at @joinHandshake

jomulr's tweet photo. Packed room to hear @alexgshaw and @ryanmart3n break down how @harborframework grew into *the* framework for RL environments.

In our RLEval workshop at @CAISconf today, attendees tackled big open challenges in RLEs & Agent Evals + I shared the approach we take at @joinHandshake https://t.co/dfRbcJafg3

2

33

10

3

6K

StevenDillmann retweeted

Ryan Marten

@ryanmart3n

9 days ago

the harbor community will be @ CAIS - come say hi! 9am Tue @ RLEval workshop Harbor & Terminal-Bench 3.0 talk by @alexgshaw / me 10:30am Tue @ RLEval workshop OpenThoughts-Agent talk by @AlexGDimakis 4pm Tue @ Agent Software Engineering workshop Harbor Adapters & Harbor Index talk by @LinShi592021 9am Wed: Keynote by @andykonwinski

3

22

6

2

3K

StevenDillmann retweeted

SLAC National Accelerator Laboratory @SLAClab

14 days ago

What sets @VRubinObs apart? Over 10 years, Rubin will be mapping the southern sky to create a comprehensive map of the cosmos. It will guide astronomers where to look next, and reveal our universe at scales previously unimaginable.

1

31

15

5

2K

Steven Dillmann

@StevenDillmann

12 days ago

@suragnair @AnthropicAI @OpenAI @GoogleDeepMind no "guarantees" but we provide canary strings for the labs to filter out the data, and the labs that care about measuring capabilities without training on test data will (hopefully) use that

0

1

0

36

StevenDillmann retweeted

Stanford AI+Biomedicine Seminar @Stanford_AI_Bio

13 days ago

Wish an AI agent could handle your next research task in the list? 👇

0

4

2

6

3K

Steven Dillmann

@StevenDillmann

13 days ago

@suragnair @AnthropicAI @OpenAI @GoogleDeepMind @suragnair Great work on CompBioBench! v1 of the benchmark will be fully open as per terminal-bench standards

1

0

134

Steven Dillmann

@StevenDillmann

13 days ago

Great article by @TimothyKassis on why we need Terminal-Bench Science - if you’re a scientist and want AI agents to become better in your domain, join us!👇 https://t.co/EwINF4CogN

Timothy Kassis

@TimothyKassis

13 days ago

https://t.co/8T6njZQMYh

1

19

4

21

3K

0

10

2

1

903

StevenDillmann retweeted

Sanmi Koyejo @sanmikoyejo

14 days ago

"AI for science" benchmarks today mostly test textbook recall. Terminal-Bench Science is a chance for scientists to practice writing that definition. Contribute a real workflow, and you find out exactly where today's best agents break on it. https://t.co/GZ28R5QIRn

0

26

6

13

3K

StevenDillmann retweeted

Chaitanya K. Joshi

@chaitjo

14 days ago

Very timely initiative!!

0

5

1

3

2K

StevenDillmann retweeted

Richard C. Suwandi

@richardcsuwandi

14 days ago

Good evals like this are exactly what we need to accelerate progress in AI for science

0

10

2

3

2K

StevenDillmann retweeted

Bodhisattwa Majumder

@mbodhisattwa

14 days ago

Wonderful project; wonderful people; please contribute for the sake of science. Bonus: @StevenDillmann will be interning with me and AutoDiscovery team @allen_ai translating benefits from TB-Science to our science agents!

0

22

2

8

2K

StevenDillmann retweeted

Bespoke Labs

@bespokelabsai

14 days ago

Consider contributing tasks to Terminal-Bench Science, the most direct way to teach AI agent to solve your AI workflows and accelerate your research.

0

8

3

4

1K

StevenDillmann retweeted

Allan

@AllanatrixQ

14 days ago

More please!

0

4

1

0

313

StevenDillmann retweeted

Chenhao Tan

@ChenhaoTan

14 days ago

Science is the frontier of AI. Contribute to this initiative if you can!

0

9

1

0

2K

StevenDillmann retweeted

Alex Dimakis

@AlexGDimakis

14 days ago

Terminal-Bench Science is a direct way to contribute to AI for Science. It's programming agents by task specification. Ask a precise scientific question and watch how AI agents will learn to solve it: Step 1. Package a scientific task or workflow, something that takes a working scientist a week or month to do into an RL environment. Step 2. Write tests that verify if the task has been done correctly (can be done easily if you have already solved the task manually). Step 3. Sit back and let AI agent progress solve it in 6 months.

9

42

7

20

5K

StevenDillmann retweeted

Leon Chen @CVPR

@realleonlc

14 days ago

Scientists, I highly encourage you to submit hard scientific tasks that you want your agents to do to this Terminal-Bench Science benchmark! Make your task seen and solved by agent/model providers. Get credit from the project.

0

5

1

649

Steven Dillmann

@StevenDillmann

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users