Cozmin Ududec

Fundraising at the Center for Reducing Suffering, and co-organising @sentfutures Summit London 2026 Trying to make powerful AI go less badly, for all sentients

21 days ago

Our evaluations show that frontier AI's cyber capabilities are advancing quickly. The length of cyber tasks frontier models can complete has been doubling every few months, and this rate has become faster over time, with recent models exceeding our previous trends. 🧵

AISecurityInst's tweet photo. Our evaluations show that frontier AI's cyber capabilities are advancing quickly. The length of cyber tasks frontier models can complete has been doubling every few months, and this rate has become faster over time, with recent models exceeding our previous trends. 🧵 https://t.co/iudBoXys1e

31

575

126

185

137K

CUdudec retweeted

Sayash Kapoor @sayashk

22 days ago

I appreciate the work by @EpochAIResearch @GregHBurnham in flagging and fixing these issues. Finding bugs in evaluations is always disappointing, but in the long run, is necessary (and extremely helpful) for improving evaluations. It also reminds me of the issues we uncovered in CORE-Bench: https://t.co/jj9F3wWMo5 As benchmarks become more complex, analyzing benchmark tasks and agent logs will become more important to ensure the validity of evaluation results. Coincidentally, today we released a paper (led by @PKirgis) on how to do log analysis well. https://t.co/rTcirSHuRO This builds on all our lessons from the trenches in conducting such evaluations and fixing the issues we found in our own work. I’m sure we’ll find many other issues in our evals, but genuinely think the evals community will be better off for having developed tools and methods to improve eval rigor.

sayashk's tweet photo. I appreciate the work by @EpochAIResearch @GregHBurnham in flagging and fixing these issues. Finding bugs in evaluations is always disappointing, but in the long run, is necessary (and extremely helpful) for improving evaluations. It also reminds me of the issues we uncovered in CORE-Bench: https://t.co/jj9F3wWMo5

As benchmarks become more complex, analyzing benchmark tasks and agent logs will become more important to ensure the validity of evaluation results. Coincidentally, today we released a paper (led by @PKirgis) on how to do log analysis well. https://t.co/rTcirSHuRO

This builds on all our lessons from the trenches in conducting such evaluations and fixing the issues we found in our own work.

I’m sure we’ll find many other issues in our evals, but genuinely think the evals community will be better off for having developed tools and methods to improve eval rigor.

2

44

5

35

13K

Who to follow

Alistair Stewart

@alistair___s

Jiankai Sun

@JiankaiSun

Ph.D. @Stanford | @GoogleDeepMind @NVIDIA | Prev @AIatMeta @MSFTResearch

Uladzimir Kasacheuski

@uladkasach

Intelligence = maximizing future opportunities

about 1 month ago

I'll be mentoring for Pivotal this summer! Apply if you're interested in personas and behaviour dynamics over long trajectories.

Pivotal Research

@pivotal_org

about 1 month ago

Language models read their own outputs as evidence for their current persona, sometimes entrenching it. Cozmin Ududec (@CUdudec) leads the Science of Evaluation team at UK AISI and is taking on Pivotal fellows to study how personas carry over, stabilise, drift, or compound across long conversations.

pivotal_org's tweet photo. Language models read their own outputs as evidence for their current persona, sometimes entrenching it.
Cozmin Ududec (@CUdudec) leads the Science of Evaluation team at UK AISI and is taking on Pivotal fellows to study how personas carry over, stabilise, drift, or compound across long conversations.

2

17

1

11

3K

0

25

1

6

2K

CUdudec retweeted

about 1 month ago

OpenAI’s GPT-5.5 is the second model to complete one of our multi-step cyber-attack simulations end-to-end 🧵

95

2K

397

746

2M

CUdudec retweeted

Noam Brown

@polynoamial

about 1 month ago

A hill that I will die on: with today's AI models, intelligence is a function of inference compute. Comparing models by a single number hasn't made sense since 2024. What matters is intelligence per token or per $. This is especially true when using it in a product like Codex.

46

1K

97

304

128K

CUdudec retweeted

Nate

@NateBurnikell

about 1 month ago

We (@AISecurityInst) tested GPT-5.5 for its cyber capabilities and safeguards. It's the strongest performing model we've tested on our narrow cyber tasks and solved one of our cyber ranges in 1/10 attempts. We found a universal jailbreak with 6 hours of expert red teaming.

NateBurnikell's tweet photo. We (@AISecurityInst) tested GPT-5.5 for its cyber capabilities and safeguards. It's the strongest performing model we've tested on our narrow cyber tasks and solved one of our cyber ranges in 1/10 attempts. We found a universal jailbreak with 6 hours of expert red teaming. https://t.co/xXt67MBTbb

17

370

55

140

51K

about 2 months ago

The paper and thread also have a lot of useful detail on best practices and pitfalls for running open-world evals well!

0

1

0

92

about 2 months ago

This paper makes a strong case for open-world evaluations as a complement to traditional benchmarks, particularly for realistic, long-horizon, open-ended settings! Glad the AISI SoE team could contribute to this effort.

Sayash Kapoor @sayashk

about 2 months ago

Benchmarks are saturated more quickly than ever. How should frontier AI evaluations evolve? In a new paper, we argue that the AI community is already converging on an answer: Open-world evaluations. They are long, messy, real-world tasks that would be impractical for benchmarks.

sayashk's tweet photo. Benchmarks are saturated more quickly than ever. How should frontier AI evaluations evolve? In a new paper, we argue that the AI community is already converging on an answer: Open-world evaluations. They are long, messy, real-world tasks that would be impractical for benchmarks. https://t.co/CrvbEd9l7f

15

252

53

183

94K

1

28

5

16

8K

about 2 months ago

More broadly: are there better ways to run these expensive, low-sample evaluations to get more insight efficiently? One idea is to run an episode end-to-end once, then return to an intermediate progress state, branch, and sample more heavily from that point. Could designs like this help us estimate time-horizons, inference-scaling efficiency, robustness, and harness effects?

1

0

110

about 2 months ago

This growing variance of solved step at a given budget (or variance in tokens to reach a step) could be a big issue for estimating performance on very long-horizon tasks at very large token budgets.

0

9

1

845

about 2 months ago

One thing I find interesting about this result is the large gap between the best run (dashed red line), and the average over 10 runs (solid heavy red line) for Mythos. At around 80M tokens, the best run is finished, while the average is still at step 20. Put another way, there is a huge variance in the random variable `log(token) to solve step n`!

about 2 months ago

We conducted cyber evaluations of Claude Mythos Preview and found that it is the first model to complete an AISI cyber range end-to-end. 🧵

AISecurityInst's tweet photo. We conducted cyber evaluations of Claude Mythos Preview and found that it is the first model to complete an AISI cyber range end-to-end. 🧵 https://t.co/gd9hi0Ve55

113

3K

550

1K

1M

5

30

2

3K

about 2 months ago

One other thought is we likely need to change how we think about measuring performance. Instead of average success rates, it should likely be something like an efficiency metric ($ cost/solve, or the slope of the inference curve).

0

4

0

120

about 2 months ago

Another nice example of the increasing effectiveness of inference scaling on very long and hard tasks, and fast saturation on new tasks! In Nov 2025, we changed our default budget from 10M to 100M tokens for some cyber tasks...which already seems too little.

david rein

@idavidrein

about 2 months ago

@tmkadamcz and I started working on MirrorCode, a new long-horizon software engineering benchmark, last September. I think it’s the best benchmark for measuring AI’s ability to complete very hard (but precisely specified) software tasks—but it’s likely already saturated.

6

173

26

51

31K

1

12

0

598

2 months ago

Interesting example of the impact of token budgets on inferred horizons!

Lyptus Research

@LyptusResearch

2 months ago

All evaluations used a 2M token budget. That is not enough. GPT-5.3 Codex jumps from 3.1h [1.7h, 6.8h] at 2M to 10.5h [2.4h, 63.5h] at 10M tokens. The error bars at 10M are wide because the benchmarks are saturating.

LyptusResearch's tweet photo. All evaluations used a 2M token budget. That is not enough. GPT-5.3 Codex jumps from 3.1h [1.7h, 6.8h] at 2M to 10.5h [2.4h, 63.5h] at 10M tokens. The error bars at 10M are wide because the benchmarks are saturating. https://t.co/Ugghwl4GYr

2

32

3

10

6K

0

2

0

197

CUdudec retweeted

7vik @satvikgolechha

2 months ago

Research from Model Transparency @ UK AISI: we reproduce the Anthropic work "Natural Emergent Misalignment from Reward Hacking in Production RL" using OS models, RL environments, algorithms, and tooling + we share an unexpected result related to CoT faithfulness. 🧵 (1 of 7)

satvikgolechha's tweet photo. Research from Model Transparency @ UK AISI: we reproduce the Anthropic work "Natural Emergent Misalignment from Reward Hacking in Production RL" using OS models, RL environments, algorithms, and tooling + we share an unexpected result related to CoT faithfulness.

🧵 (1 of 7) https://t.co/d8dDAkkd8Z

3

181

25

115

22K

2 months ago

This is currently my favourite way to present eval results: inference scaling curves, across model generations, split by task difficulty. You can easily see the impact of token budgets, how performance becomes more log-linear over time, and how recent model performance on hard tasks looks like older model performance on easy tasks...

CUdudec's tweet photo. This is currently my favourite way to present eval results: inference scaling curves, across model generations, split by task difficulty.
You can easily see the impact of token budgets, how performance becomes more log-linear over time, and how recent model performance on hard tasks looks like older model performance on easy tasks...