Thomas Read @thjread - Twitter Profile

5 days ago

@UrielDolev @AsaCoopStick Notably it didn't help at all on Mythos Preview, which already did a good job at following the default prompt

1

0

28

Thomas Read @thjread

5 days ago

@UrielDolev @AsaCoopStick Not yet - I would expect similar things might help on other models, though perhaps not to the same extent. You'll need to swap out "thinking" for whatever terminology the model likes to use (e.g. OpenAI models call their CoT the "analysis channel")

2

1

0

36

Thomas Read @thjread

5 days ago

@UrielDolev @AsaCoopStick Shared the prompt here! https://t.co/ss0kOfTwj0 We didn't use the work from that paper, although it is of course very relevant if you're trying to rely on low CoT controllability in practice

2

0

1

35

thjread retweeted

Geoffrey Irving

@geoffreyirving

6 days ago

We are starting a new, nonprofit alignment organization, ⊢ Sequent Research, bringing together researchers previously on UK AISI’s Alignment Team, Timaeus, and elsewhere to research how to align superintelligence. We are hiring! 🧵

geoffreyirving's tweet photo. We are starting a new, nonprofit alignment organization, ⊢ Sequent Research, bringing together researchers previously on UK AISI’s Alignment Team, Timaeus, and elsewhere to research how to align superintelligence. We are hiring! 🧵 https://t.co/UziUGbIPdU

27

961

144

418

189K

Who to follow

Pawel Sobocinski

@PawSob

Academic. Fan of concurrency, string diagrams and linear algebra (https://t.co/kvk4u228p2). Head of Compositionality Group at Taltech (https://t.co/W0XAIhMCAK).

David Corfield

@DavidCorfield8

Philosopher. Published with 'Modal Homotopy Type Theory' (OUP, 2020). Published with D. Leader 'Why do people get ill?' (Penguin, 2007).

Dr Valeria dePaiva

@valeriadepaiva

Logician, category theorist, semanticist. Likes to understand life mathematically. She/Her

thjread retweeted

Joseph Bloom

@JBloomAus

7 days ago

Model Transparency at the @AISecurityInst evaluated Claude Mythos 5 for capabilities and behaviours relevant to monitorability, our first time doing this in pre-deployment testing! Details in thread 🧵

2

85

16

36

10K

Thomas Read @thjread

26 days ago

I helped write this report on oversight of AI systems and how it could degrade - it's a great overview, and a good guide to what research directions might help us maintain the level of oversight we enjoy today

AI Security Institute

@AISecurityInst

27 days ago

The safety of advanced AI systems increasingly depends on the ability to oversee them. Our new report examines today’s AI oversight landscape, finding many pathways likely to lead to its degradation.🧵

AISecurityInst's tweet photo. The safety of advanced AI systems increasingly depends on the ability to oversee them. Our new report examines today’s AI oversight landscape, finding many pathways likely to lead to its degradation.🧵 https://t.co/qJnYG64lWS

5

140

36

84

32K

0

2

1

0

181

thjread retweeted

Jordan Taylor @JordanTensor

26 days ago

There are a lot of pathways via which AI oversight is likely to degrade! Latent reasoning architectures, situational awareness, representational drift... We wrote a report ranking them. Here I'll go into some which worry me most 🧵

1

22

5

8

2K

thjread retweeted

Joseph Bloom

@JBloomAus

about 1 month ago

Thanks @Jack_W_Lindsey for your comments on the eval awareness work my team put out last month (primarily @thjread) - we've edited the post to add this (with permission) for those interested in eval awareness / steering https://t.co/d60u0OmVU8

0

18

2

13

1K

thjread retweeted

Joseph Bloom

@JBloomAus

about 2 months ago

(My team) Model Transparency at @AISecurityInst is hiring Research Engineers and Research Scientists! Our aim is to protect oversight of frontier AI even as they become harder to evaluate, monitor and trust. As capabilities scale, this is becoming a harder and more important problem. 🧵

6

220

20

119

31K

Thomas Read @thjread

2 months ago

@wassname Oh interesting! I hadn't thought of that but it does seem pretty plausible (like how I think SFT tends to overwrite the most recently learned things)

1

0

11

Thomas Read @thjread

2 months ago

Our key finding: "control" steering vectors, derived from contrastive pairs unrelated to alignment, can have effects as large as deliberately designed evaluation-awareness vectors. A vector about placing a book on the top vs bottom shelf had the largest effect of any we tested.

thjread's tweet photo. Our key finding: "control" steering vectors, derived from contrastive pairs unrelated to alignment, can have effects as large as deliberately designed evaluation-awareness vectors. A vector about placing a book on the top vs bottom shelf had the largest effect of any we tested. https://t.co/mGFcdeL51M

2

8

0

1

352

Thomas Read @thjread

2 months ago

@wassname Wow that's great 😅 like how some OpenAI models think every eval is written by Redwood https://t.co/Gpryk3EGDe

1

0

10

Thomas Read @thjread

2 months ago

@Steven_B_Lee It would be interesting to compare to steering with genuinely random directions (not from contrastive pairs), which I would expect to have much less of an effect

0

10

Thomas Read @thjread

2 months ago

With an open-weight model that games evaluations in hand, we were able to follow the Opus 4.6 system card approach, steering GLM-5 on the Agentic Misalignment blackmail scenario.

1

7

0

338

Thomas Read @thjread

2 months ago

@Steven_B_Lee I don't think they're doing nothing! I think they actually are affecting the model's behaviour in ways that end up quite coherently affecting e.g. eval awareness, but I don't know why exactly each individual vector does what it does

1

0

11

Thomas Read @thjread

2 months ago

@wassname some points in favour of steering: - it's able to sometimes reveal eval gaming in cases where the unsteered model never does the misaligned behaviour - in practice it doesn't seem prone to false positives

1

0

11

Thomas Read @thjread

2 months ago

@wassname The effects of the targeted vectors do point in the right direction, and are larger on average... but I agree it's not super compelling!

1

0

12

Thomas Read @thjread

2 months ago

@Jandrade0112 Yeah - I guess if you saw serious misalignment from this sort of steering I'd still want you to investigate seriously (and not dismiss it as being "smaller than baseline"), but also we should just get better methods

0

1

Thomas Read @thjread

2 months ago

This work was done at UK AISI, advised by @JBloomAus and @BronsonSchoen Finding evaluation gaming in GLM-5: https://t.co/KqnzgqVDzO Steering against evaluation awareness: https://t.co/4vadBmpcdb

1

7

0

2

384

Thomas Read @thjread

2 months ago

We think it’s valuable to reproduce safety relevant phenomena and methods on open-weight models, especially with white-box methods; we think this work is a good example. Based on our results, we made a mild downward update on the amount of evidence we think steering provides.

2

9

1

481

Thomas Read

@thjread

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users