Josh Engels @JoshAEngels - Twitter Profile

Pinned Tweet

7 months ago

I ran LLM autoraters that trawled through the Gemini 3 Pro RL rollouts to surface weird behaviors. This worked surprisingly well! 🧵 I found that Gemini was sometimes aware of its environment and showed extreme emotions, like flipping a table: (╯°□°)╯︵ ┻━┻

JoshAEngels's tweet photo. I ran LLM autoraters that trawled through the Gemini 3 Pro RL rollouts to surface weird behaviors. This worked surprisingly well! 🧵

I found that Gemini was sometimes aware of its environment and showed extreme emotions, like flipping a table:

(╯°□°)╯︵ ┻━┻ https://t.co/XzxSK1PwIv

9

126

14

50

30K

Josh Engels @JoshAEngels

8 days ago

The obfuscation results themselves are worrying, but it’s not clear how realistic the threat model we study is: will future models actually have the same monitor knowledge as the model organisms we train? Towards this end it would be great to have a better science of SDF.

0

119

Josh Engels @JoshAEngels

8 days ago

Really cool new work with my great mentee Reilly: models that learn they’re being monitored can obfuscate their CoT! Elegantly, a model’s obfuscation success is almost perfectly predicted by its *CoT controllability*: how well the model can control its CoT. Implications in 🧵

Reilly H @ReillyHaskins02

8 days ago

Could future models learn that their CoT is being monitored and hide their reasoning to evade detection? In our new paper, @JoshAEngels, @bilalchughtai_, and I find that yes, models finetuned on docs describing a CoT monitor evade detection far more often than unaware models 🧵

ReillyHaskins02's tweet photo. Could future models learn that their CoT is being monitored and hide their reasoning to evade detection?

In our new paper, @JoshAEngels, @bilalchughtai_, and I find that yes, models finetuned on docs describing a CoT monitor evade detection far more often than unaware models 🧵 https://t.co/4ERUgsK9Z3

3

65

11

36

7K

2

7

1

975

Josh Engels @JoshAEngels

8 days ago

IMO this is evidence we should be tracking CoT controllability in frontier models and ideally trying to lower it. I’m also interested in *why* CoT controllability is higher in some models; can we root cause the data or envs that cause this and intervene on them?

1

0

281

JoshAEngels retweeted

Reilly H @ReillyHaskins02

8 days ago

Could future models learn that their CoT is being monitored and hide their reasoning to evade detection? In our new paper, @JoshAEngels, @bilalchughtai_, and I find that yes, models finetuned on docs describing a CoT monitor evade detection far more often than unaware models 🧵

3

65

11

36

7K

JoshAEngels retweeted

Max Kaufmann @Max_A_Kaufmann

2 months ago

Is training against the CoT always bad? RL training can lead to obfuscated CoT making it difficult to 'read an LLMs thoughts'. How can we predict when obfuscation occurs?🤔 Our new @GoogleDeepMind paper introduces a framework to predict this before training starts!

Max_A_Kaufmann's tweet photo. Is training against the CoT always bad?

RL training can lead to obfuscated CoT making it difficult to 'read an LLMs thoughts'. How can we predict when obfuscation occurs?🤔

Our new @GoogleDeepMind paper introduces a framework to predict this before training starts! https://t.co/AxsayBQEjS

6

159

24

129

30K

JoshAEngels retweeted

Riya Tyagi @riyatyagi86

2 months ago

Has “just read the chain of thought” solved interpretability? We don’t think so, but it’s surprisingly hard to prove. Our solution: 9 hard tasks that reading the CoT does not solve. Now, let’s build stronger interp techniques!

riyatyagi86's tweet photo. Has “just read the chain of thought” solved interpretability? We don’t think so, but it’s surprisingly hard to prove.

Our solution: 9 hard tasks that reading the CoT does not solve. Now, let’s build stronger interp techniques! https://t.co/Fp0FSJ7H28

1

100

9

66

27K

JoshAEngels retweeted

Rohin Shah @rohinmshah

2 months ago

"Just read the chain of thought" is one of our best safety techniques. Why does it work? Because models can only think opaquely for a short time, long thinking must be transparent Can we quantify this? Yes! In our new paper, we show how to measure "time" for arbitrary networks.

rohinmshah's tweet photo. "Just read the chain of thought" is one of our best safety techniques. Why does it work? Because models can only think opaquely for a short time, long thinking must be transparent

Can we quantify this? Yes! In our new paper, we show how to measure "time" for arbitrary networks. https://t.co/yzhInGEpuL

13

464

59

334

35K

Josh Engels @JoshAEngels

3 months ago

Also interesting to consider if and in what way models are already modifying their CoT in various out-of-context ways based off documents they see in pretraining.

0

141

Josh Engels @JoshAEngels

3 months ago

Very cool work with my mentee Reilly: models trained on docs that say their CoT will be monitored obfuscate their CoT! This "naturally emergent CoT obfuscation" is worrying because it could unexpectedly arise in pretraining. We should maybe avoid training on AI control docs?

Reilly H @ReillyHaskins02

3 months ago

1/ We trained models on docs describing a CoT monitoring setup. Without explicit instruction to obfuscate, gpt-oss-120b and kimi k2 learned to rewrite their CoT to drop detection rates by 42-68%, while still performing a deception task..

ReillyHaskins02's tweet photo. 1/ We trained models on docs describing a CoT monitoring setup. Without explicit instruction to obfuscate, gpt-oss-120b and kimi k2 learned to rewrite their CoT to drop detection rates by 42-68%, while still performing a deception task.. https://t.co/WtfZ4qrA5A

2

53

9

32

7K

6

35

2

15

3K

JoshAEngels retweeted

Callum McDougall @calsmcdougall

3 months ago

Announcing new ARENA material: 8 new exercise sets on alignment science, interpretability & AI safety - each containing 1-2 days of structured, hands-on content replicating key papers in the field. All open source on a public GitHub, and available for study. Here's what's in it:

calsmcdougall's tweet photo. Announcing new ARENA material: 8 new exercise sets on alignment science, interpretability & AI safety - each containing 1-2 days of structured, hands-on content replicating key papers in the field.

All open source on a public GitHub, and available for study. Here's what's in it: https://t.co/fmnZNHhrKn

14

608

78

597

84K

JoshAEngels retweeted

Iván Arcuschin @IvanArcus

4 months ago

You change one word on a loan application: the religion. The LLM rejects it. Change it back? Approved. The model never mentions religion. It just frames the same debt ratio differently to justify opposite decisions. We built a pipeline to find these hidden biases 🧵1/13

IvanArcus's tweet photo. You change one word on a loan application: the religion. The LLM rejects it.

Change it back? Approved.

The model never mentions religion. It just frames the same debt ratio differently to justify opposite decisions.

We built a pipeline to find these hidden biases 🧵1/13 https://t.co/hgIZIReYb9

236

12K

2K

5K

874K

JoshAEngels retweeted

Subhash Kantamneni

@thesubhashk

4 months ago

We recently released a paper on Activation Oracles (AOs), a technique for training LLMs to explain their own neural activations in natural language. We piloted a variant of AOs during the Claude Opus 4.6 alignment audit. We thought they were surprisingly useful! 🧵

thesubhashk's tweet photo. We recently released a paper on Activation Oracles (AOs), a technique for training LLMs to explain their own neural activations in natural language.

We piloted a variant of AOs during the Claude Opus 4.6 alignment audit. We thought they were surprisingly useful! 🧵 https://t.co/EzECVlVzcw

11

207

34

113

28K

Josh Engels @JoshAEngels

4 months ago

@matonski Some links of research pointed at in this thread: Do LLMs detect CoT tampering: https://t.co/zIJrImwCWU Prompted CoT obfuscation: https://t.co/42kGmNqC49 RL reward hacking mitigations: https://t.co/9Iv6BsxFQM

0

131

Josh Engels @JoshAEngels

4 months ago

A cool recent project with @matonski exploring steering language models by editing their thoughts. This is more powerful than prompting because instead of just providing an initial direction, you can actually steer the model in natural language as it thinks. Some thoughts in 🧵

Anton de la Fuente

@matonski

4 months ago

Reasoning models think before they answer. Can you steer their behavior by editing their thoughts? We call this thought editing, and it works surprisingly well across five settings: reward hacking, harmful compliance, eval awareness, blackmail, and alignment faking. 🧵

matonski's tweet photo. Reasoning models think before they answer. Can you steer their behavior by editing their thoughts?

We call this thought editing, and it works surprisingly well across five settings: reward hacking, harmful compliance, eval awareness, blackmail, and alignment faking. 🧵 https://t.co/bM499ljMLZ

4

67

6

57

19K

1

2

0

354

Josh Engels @JoshAEngels

4 months ago

@matonski Another angle that's interesting here is the distinction between inference and training time interventions. While we show that this CoT interventions are effective for reducing reward hacking, I expect models would quickly learn to ignore them if applied during RL.

1

0

177

JoshAEngels retweeted

Anton de la Fuente

@matonski

4 months ago

Reasoning models think before they answer. Can you steer their behavior by editing their thoughts? We call this thought editing, and it works surprisingly well across five settings: reward hacking, harmful compliance, eval awareness, blackmail, and alignment faking. 🧵

4

67

6

57

19K

Josh Engels @JoshAEngels

5 months ago

5/5: With @ArthurConmy , @JanosKramar, @bilalchughtai_, @rohinmshah, @NeelNanda5 Paper: https://t.co/ZjzveC8zZw

0

1

158

Josh Engels @JoshAEngels

5 months ago

1/5: We’ve got a cool new @GoogleDeepMind paper out on activation probing for misuse mitigation! Check it out for a bunch of techniques to make activation probes (even) better, including long-context generalizing architectures, AlphaEvolve, probe + LLM cascades, and seed-maxing.

Arthur Conmy

@ArthurConmy

5 months ago

Our new @GoogleDeepMind paper studies novel activation probe architectures for classifying real-world misuse risks. Our research has informed live deployments of probes in Gemini. 🧵

ArthurConmy's tweet photo. Our new @GoogleDeepMind paper studies novel activation probe architectures for classifying real-world misuse risks.

Our research has informed live deployments of probes in Gemini. 🧵 https://t.co/VekT2JKZYG

16

713

59

392

135K

1

21

1

9

2K

Josh Engels @JoshAEngels

5 months ago

4/5: More takeaways: - There's probably still a good amount of headroom to improve probes, but I do think we applied more optimization pressure here than past work. - Somewhat surprising to me: combing probes and LLMs results in classifiers that are better than both alone.

1

2

0

1

145

Josh Engels

@JoshAEngels

Last Seen Users on Sotwe

Trends for you

Most Popular Users