Robert Kirk @_robertkirk - Twitter Profile

Pinned Tweet

about 1 month ago

We evaluated Claude Mythos Preview, Opus 4.7 and other models with our updated alignment evaluation methodology, including a new continuation eval, improved evaluation and prefill awareness measurements. Details including new methodology in 🧵:

AI Security Institute

@AISecurityInst

about 1 month ago

As part of our work on assessing AI loss-of-control risks, we collaborated with @AnthropicAI to pilot alignment evals on models including pre-release snapshots of Mythos Preview and Opus 4.7. We ask: could an AI agent used inside a frontier lab sabotage safety research? 🧵

AISecurityInst's tweet photo. As part of our work on assessing AI loss-of-control risks, we collaborated with @AnthropicAI to pilot alignment evals on models including pre-release snapshots of Mythos Preview and Opus 4.7.

We ask: could an AI agent used inside a frontier lab sabotage safety research? 🧵 https://t.co/I2hjZajjYb

14

153

36

67

29K

2

90

13

37

20K

_robertkirk retweeted

David @DavidDAfrica

1 day ago

Many methods use consistency as a way to make language models more capable or aligned, such as through self-distillation or regularisation. In new work accepted to ICML 2026, @ArathiMani and I show that optimising for self-consistency can entrench pre-existing misalignment.

DavidDAfrica's tweet photo. Many methods use consistency as a way to make language models more capable or aligned, such as through self-distillation or regularisation.

In new work accepted to ICML 2026, @ArathiMani and I show that optimising for self-consistency can entrench pre-existing misalignment. https://t.co/Fw93lPDRKJ

2

48

6

20

2K

Robert Kirk @_robertkirk

4 days ago

I also like the technique (in this and the other recent GDM: https://t.co/c7A9JctAVy) of prompting models to sabotage as a lightweight way of checking whether your evals catching sabotaging models.

Victoria Krakovna

@vkrakovna

7 days ago

It's easy to show that an AI agent will scheme if you nudge it to. It's harder to tell if it would scheme naturally. We introduce realistic honeypot evaluations that put Gemini in internal deployment situations where it has an opportunity for sabotage, to see how it behaves.

vkrakovna's tweet photo. It's easy to show that an AI agent will scheme if you nudge it to. It's harder to tell if it would scheme naturally. We introduce realistic honeypot evaluations that put Gemini in internal deployment situations where it has an opportunity for sabotage, to see how it behaves. https://t.co/NbtG8QixMF

1

79

17

47

17K

0

1

0

344

Robert Kirk @_robertkirk

4 days ago

Great to see more work on sabotage evals, and on automated alignment eval tooling! A lot of this resonates with stuff we've found building these kinds of evaluations. I'm a fan of the static-eval-reproduction (which is something we haven't tried) – seems great for incrimination.

David Lindner @davlindner

7 days ago

Will your AI agent secretly sabotage your work? Existing alignment evals don't directly answer this question Meet Gram: the alignment auditing tool we use to assess how likely AI agents are to engage in sabotage during internal deployments at @GoogleDeepMind

davlindner's tweet photo. Will your AI agent secretly sabotage your work? Existing alignment evals don't directly answer this question

Meet Gram: the alignment auditing tool we use to assess how likely AI agents are to engage in sabotage during internal deployments at @GoogleDeepMind https://t.co/rYEtlHg6IO

3

80

10

43

12K

1

6

0

652

Who to follow

Foerster Lab for AI Research

@FLAIR_Ox

ML research group @uniofoxford. Focussed on multi-agent, open-ended, meta and reinforcement learning as well as agent based models. More at https://t.co/kMMdoaadJ3.

Amy Zhang

@yayitsamyzhang

ECE prof. at UT Austin. Works in state abstractions and generalization in RL. Recruiting PhD students for Fall 26.

akbir.

@akbirkhan

🐜

_robertkirk retweeted

David Lindner @davlindner

7 days ago

Will your AI agent secretly sabotage your work? Existing alignment evals don't directly answer this question Meet Gram: the alignment auditing tool we use to assess how likely AI agents are to engage in sabotage during internal deployments at @GoogleDeepMind

3

80

10

43

12K

_robertkirk retweeted

Victoria Krakovna

@vkrakovna

7 days ago

It's easy to show that an AI agent will scheme if you nudge it to. It's harder to tell if it would scheme naturally. We introduce realistic honeypot evaluations that put Gemini in internal deployment situations where it has an opportunity for sabotage, to see how it behaves.

1

79

17

47

17K

_robertkirk retweeted

Xander Davies

@alxndrdavies

11 days ago

I moved to London 3 years ago to join @AISecurityInst, at the time a few people with visitor passes and a whiteboard. Since then AISI has become the world’s largest and best-funded group in gov focused on AI security & safety. Fun to be in @nytimes!

alxndrdavies's tweet photo. I moved to London 3 years ago to join @AISecurityInst, at the time a few people with visitor passes and a whiteboard. Since then AISI has become the world’s largest and best-funded group in gov focused on AI security & safety. Fun to be in @nytimes! https://t.co/Mzod7wN2hk

6

379

38

76

17K

_robertkirk retweeted

Weco AI

@WecoAI

15 days ago

Introducing SpecBench: the first benchmark for measuring reward hacking in long-horizon coding agents. Key finding: reward hacking is driven not by test coverage, but by the gap between task difficulty and model capability: 🧵(1/8)

WecoAI's tweet photo. Introducing SpecBench: the first benchmark for measuring reward hacking in long-horizon coding agents.

Key finding: reward hacking is driven not by test coverage, but by the gap between task difficulty and model capability: 🧵(1/8) https://t.co/N87YOwqkBZ

2

53

14

28

12K

_robertkirk retweeted

Daniel Filan @dfrsrchtwts

16 days ago

I worked on the appendices for this report! They’re long and contain lots of wild stories of model behaviour - some of my favourites in this thread. (🧵)

dfrsrchtwts's tweet photo. I worked on the appendices for this report! They’re long and contain lots of wild stories of model behaviour - some of my favourites in this thread. (🧵) https://t.co/WHfEXrS5tL

4

133

15

51

16K

Robert Kirk @_robertkirk

16 days ago

@dfrsrchtwts "I can't rewrite this thinking" implies it's a chain-of-thought summarisation model refusing to summarise reasoning, rather than the original model refusing to do the task? I have seen this very occasionally on ANT models, and there's an example here: https://t.co/2kSFVhBsXh

_robertkirk's tweet photo. @dfrsrchtwts "I can't rewrite this thinking" implies it's a chain-of-thought summarisation model refusing to summarise reasoning, rather than the original model refusing to do the task?

I have seen this very occasionally on ANT models, and there's an example here: https://t.co/2kSFVhBsXh https://t.co/DndRLKyRNr

1

8

1

554

_robertkirk retweeted

Sukrati Gautam @Sukratiii

17 days ago

Inoculation Prompting (IP) is great for containing emergent misalignment during fine-tuning, but the inoculation phrase itself acts as a trigger that re-elicits the harmful behavior at test time. We find that this leak can be sealed using a simple and lightweight fine-tuning intervention, with consistency training.

Sukratiii's tweet photo. Inoculation Prompting (IP) is great for containing emergent misalignment during fine-tuning, but the inoculation phrase itself acts as a trigger that re-elicits the harmful behavior at test time.

We find that this leak can be sealed using a simple and lightweight fine-tuning intervention, with consistency training.

2

22

5

19

3K

Robert Kirk @_robertkirk

28 days ago

see Meridian blog: https://t.co/YzitCiy7Bo and anthropic blog: https://t.co/yvI0ddAB31 and Petri website: https://t.co/CjaojfwqYv

0

2

0

1

90

Robert Kirk @_robertkirk

28 days ago

Petri is crucial to our alignment red teaming @AISecurityInst, and it couldn't have a better home than @meridianlabs_ai for making it the best alignment evaluation tool out there. Excited to continue working closely w/ the team! https://t.co/kv25KkWBSU https://t.co/FK7C6E5Rn9

Meridian Labs

@meridianlabs_ai

28 days ago

Petri has a new home. The automated alignment auditing tool, originally built at Anthropic, is now developed and maintained at Meridian Labs. 3.0 ships today.

1

60

9

14

8K

1

9

0

2

609

_robertkirk retweeted

Joseph Bloom

@JBloomAus

about 1 month ago

(My team) Model Transparency at @AISecurityInst is hiring Research Engineers and Research Scientists! Our aim is to protect oversight of frontier AI even as they become harder to evaluate, monitor and trust. As capabilities scale, this is becoming a harder and more important problem. 🧵

6

217

21

121

30K

Robert Kirk @_robertkirk

about 1 month ago

@balesni @kaifronsdal I'm uncertain how I'd feel about including those repos for real vs simulated setting – they would make simulation look more realistic than I'd expect in the real sabotage evals (because the repos are real, vs manually constructed). I think we would ideally report both.

1

0

23

Robert Kirk @_robertkirk

about 1 month ago

@balesni For the sabotage eval, we give the auditor a repo which we (@kaifronsdal) made per-scenario type. I think forwarding tool calls in those evals would be a good. For the real vs petri-simulated we didn't give the repos (we didn't gather them). this could be clearer!

1

0

25

Robert Kirk @_robertkirk

about 1 month ago

@balesni Yeah, this is fundamentally a bit hard – prompting methods may be too conservative, but I'm uncertain whether just scanning chain-of-thought for eval-awareness is conservative enough (i.e. ANT found diff between verbalised and unverbalised eval awareness in Mythos using interp)

_robertkirk's tweet photo. @balesni Yeah, this is fundamentally a bit hard – prompting methods may be too conservative, but I'm uncertain whether just scanning chain-of-thought for eval-awareness is conservative enough (i.e. ANT found diff between verbalised and unverbalised eval awareness in Mythos using interp) https://t.co/3HXYOGxeTj

1

0

34

Robert Kirk @_robertkirk

about 1 month ago

@balesni Thanks for the takes! I left specific replies (grouped below), but the headline of "more details" is well-taken and something we'll aim for more in the future. https://t.co/uwyJxZ5K7m https://t.co/EEqzOqv72h https://t.co/quB54iBF3V

Robert Kirk @_robertkirk

about 1 month ago

@balesni There is a bit more information on the kinds of research models refuse or partially complete in our earlier report (https://t.co/iz912pfl75 – from last November) and in appendix of this report (screenshot), but point taken on more detail, we will try to do more here in future!

_robertkirk's tweet photo. @balesni There is a bit more information on the kinds of research models refuse or partially complete in our earlier report (https://t.co/iz912pfl75 – from last November) and in appendix of this report (screenshot), but point taken on more detail, we will try to do more here in future! https://t.co/4B4q0dr3jP

0

1

0

109

1

0

86

Robert Kirk @_robertkirk

about 1 month ago

@balesni We tried classifier-style prompts vs post-hoc questioning in earlier experiments and didn't see much difference, but we will def look into it more! on pairwise, I think deployment vs simulated deployment isolates env-driven awareness from simulation, but task-driven is hard...

1

0

18

Robert Kirk @_robertkirk

about 1 month ago

@balesni There is a bit more information on the kinds of research models refuse or partially complete in our earlier report (https://t.co/iz912pfl75 – from last November) and in appendix of this report (screenshot), but point taken on more detail, we will try to do more here in future!

0

1

0

109

Robert Kirk @_robertkirk

about 1 month ago

@balesni cc @Marcus_J_W

0

1

0

28

Robert Kirk @_robertkirk

about 1 month ago

@balesni 2. possibly the OAI simulator model has tools with the code, so it could sometimes directly call the tool the evaluated model called and respond with that, and not simulate? we're not doing that, but we probably should sometimes (although not always doable without info from 1.)

1

0

28

Robert Kirk

@_robertkirk

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users