Kilian Lieret @KLieret - Twitter Profile

Kilian Lieret @KLieret

about 7 hours ago

John is literally the goat of benchmarks, hear him talk about ProgramBench, CodeClash and SWE-Bench

vincent sunn chen

@vincentsunnchen

about 11 hours ago

Kudos to the ProgramBench team! + @KLieret (co-lead) @18jeffreyma @parth007_96 @dpedch @sten_sootla @micmylin @pengchengyin @magpie_rayhou @syhw @Diyi_Yang @OfirPress YouTube here: https://t.co/01Bsd3SxHQ

0

9

1

3

3K

0

18

2

6

2K

Kilian Lieret @KLieret

1 day ago

ProgramBench mascot IRL. Couldn't find a gray fox plush

2

18

2

3

1K

Kilian Lieret @KLieret

1 day ago

@Cypher_Samurai @SarahLacard @xeophon @OfirPress @jyangballin @18jeffreyma So the agent runs within the docker container so there's also no user in a docker group (there's no docker in docker installed)

1

0

31

Kilian Lieret @KLieret

2 days ago

@SarahLacard @xeophon @OfirPress @jyangballin @18jeffreyma the agents were running as a non-root user in the container and I haven't seen any access escalation hacks

1

2

0

45

Who to follow

Aishik Ghosh

@Aishik_Ghosh_

Assistant Professor in School of Physics @GeorgiaTech Physics ∩ AI AI Policy

Garrett Merz

@merz_garrett

Postdoc, AI for physics @datascience_uw- prev @UMichPhysics, @OSUphysics. Empty hands & desire to unbuild walls. he/they, I guess

Heather Gray

@profheathergray

african physicist trying not to destroy the world. associate professor at uc berkeley. studies tiny particles with gigantic accelerators. opinions my own.

KLieret retweeted

Ofir Press

@OfirPress

3 days ago

Anyone who has spent more than 30 seconds running frontier models on tough benchmarks knows that they like finding ways to cheat. Here's the most creative method we caught an agent using to cheat on ProgramBench. w/ @jyangballin @KLieret @18jeffreyma

3

80

7

55

20K

KLieret retweeted

Ofir Press

@OfirPress

3 days ago

@jyangballin Full ProgramBench Q&A: https://t.co/4oKq27BZf2 Full benchmark at https://t.co/CphwwpKzmf

2

9

1

2

2K

KLieret retweeted

Ofir Press

@OfirPress

3 days ago

John (@jyangballin) talking about the wide behavioral differences between GPT and Claude on ProgramBench.

3

29

4

6

5K

KLieret retweeted

Ofir Press

@OfirPress

3 days ago

Kilian (@KLieret) on why the initial 0% top scores on ProgramBench also surprised us

2

7

1

3

5K

Kilian Lieret @KLieret

6 days ago

@Jokerinfina @jyangballin will update early next week

2

1

0

1

61

Kilian Lieret @KLieret

6 days ago

One of my favorite things from the Anthropic system cards are the examples of strange model behavior. This one is on frustration in chain of thought reasoning (and seems to have been largely resolved)

KLieret's tweet photo. One of my favorite things from the Anthropic system cards are the examples of strange model behavior. This one is on frustration in chain of thought reasoning (and seems to have been largely resolved) https://t.co/DSwldBvOxu

0

7

0

1

635

Kilian Lieret @KLieret

6 days ago

I like this specific study from the Opus 4.8 model card: How often is the model lazy and (incorrectly) guesses program behavior without actually checking it by tracing through the whole call stack. Definitely have had that happen before

KLieret's tweet photo. I like this specific study from the Opus 4.8 model card: How often is the model lazy and (incorrectly) guesses program behavior without actually checking it by tracing through the whole call stack. Definitely have had that happen before https://t.co/M9hq1zdvxI

0

9

0

1

470

Kilian Lieret @KLieret

6 days ago

SWE-bench multimodal is still very hard (numbers from Anthropic system cards)

0

24

2

1

6K

Kilian Lieret @KLieret

6 days ago

@stalkermustang it would still not be very fair, because we don't really know what agent setup they were running. They talk about "episodes" on the left plot, so they probably reran an agent several times. We're planning to update our leaderboard next week

1

2

0

56

Kilian Lieret @KLieret

6 days ago

Very cool to see ProgramBench scaling charts for Opus 4.8! The % hidden tests passed is not the "official" metric, but it makes sense for these studies (though I generally consider it to be misleading for the overall benchmark)