Roland S. Zimmermann @zimmerrol - Twitter Profile

about 1 month ago

RL assumes that LLMs explore well during training. What if they choose not to? In our new ICML paper with @GoogleDeepMind, we train LLMs that strategically resist RL capability elicitation by under-exploring. We study this threat model, called exploration hacking.

BraunJoschka's tweet photo. RL assumes that LLMs explore well during training. What if they choose not to?

In our new ICML paper with @GoogleDeepMind, we train LLMs that strategically resist RL capability elicitation by under-exploring.

We study this threat model, called exploration hacking. https://t.co/fDtxZmh6Fi

8

386

46

304

46K

zimmerrol retweeted

Max Kaufmann @Max_A_Kaufmann

2 months ago

Is training against the CoT always bad? RL training can lead to obfuscated CoT making it difficult to 'read an LLMs thoughts'. How can we predict when obfuscation occurs?🤔 Our new @GoogleDeepMind paper introduces a framework to predict this before training starts!

Max_A_Kaufmann's tweet photo. Is training against the CoT always bad?

RL training can lead to obfuscated CoT making it difficult to 'read an LLMs thoughts'. How can we predict when obfuscation occurs?🤔

Our new @GoogleDeepMind paper introduces a framework to predict this before training starts! https://t.co/AxsayBQEjS

6

161

24

130

30K

zimmerrol retweeted

David Lindner @davlindner

5 months ago

New DeepMind x UK AISI paper: what would it take to prevent harm from misaligned AI agents via monitoring in real deployments? We wrote a safety case sketch for control monitoring No flashy results but lots of important details for deploying future AI agents safely!

davlindner's tweet photo. New DeepMind x UK AISI paper: what would it take to prevent harm from misaligned AI agents via monitoring in real deployments? We wrote a safety case sketch for control monitoring

No flashy results but lots of important details for deploying future AI agents safely! https://t.co/GTsIyk1koL

8

93

27

63

20K

zimmerrol retweeted

Edward Grefenstette @egrefen

5 months ago

Extrajudicial killings of suspected drug dealers, bombing another country and kidnapping its leader w/o congressional approval. Yes, Maduro is a dictator, but without due process and international law being upheld, if it's Venezuela today who's to say it's not Greenland tomorrow?

1

14

4

2

2K

Who to follow

Patrik Reizinger

@rpatrik96

🇭🇺 🇪🇺 ML researcher @MPI_IS, @ELLISforEurope | Causal representation learning | Building research tools | Newsletter: https://t.co/TPP2SvAvqr

Konstantin Willeke @ CVPR

@KonstantinWille

Leading brain foundation models @metamorphiclabs. Prev: @Stanford, @MPI_IS, @sinzlab.

Vishaal Udandarao

@vishaal_urao

@ELLISforEurope PhD Student @bethgelab; Currently @Apple; Previously @GoogleAI @GoogleDeepMind @Cambridge_Uni @RutgersU @iiitdelhi

zimmerrol retweeted

Paul Graham

@paulg

5 months ago

I used to be able to claim that tech billionaires didn't actually do this — that they just wanted to refine their gadgets. But unfortunately in the current administration we've seen all three.

93

4K

210

246

334K

zimmerrol retweeted

Rep. Brian Fitzpatrick 🇺🇸

@RepBrianFitz

7 months ago

This is a plan that actually makes sense.

1K

7K

1K

298

482K

zimmerrol retweeted

Rohin Shah @rohinmshah

7 months ago

We built and validated an autorater that can provide a leading indicator of CoT illegibility https://t.co/0kWInlwqb7

1

41

6

18

7K

Roland S. Zimmermann @zimmerrol

7 months ago

Interested in CoT monitoring? Our latest paper could be interesting for you!

Scott Emmons @emmons_scott

7 months ago

CoT monitoring is one of our best shots at AI safety. But it's fragile and could be lost due to RL or architecture changes. Would we even notice if it starts slipping away? 🧵

emmons_scott's tweet photo. CoT monitoring is one of our best shots at AI safety. But it's fragile and could be lost due to RL or architecture changes.

Would we even notice if it starts slipping away? 🧵 https://t.co/B4HoFAqjQK

2

71

9

47

11K

0

5

0

1

424

zimmerrol retweeted

Lewis Ho @_lewisho

9 months ago

The latest version of our framework expands the scope of safety case reviews and includes a new harmful manipulation CCL.

0

10

2

1K

zimmerrol retweeted

David Lindner @davlindner

9 months ago

MATS is a great opportunity to start your career in AI safety! For MATS 9.0 I'll be running a research stream together with @emmons_scott @jenner_erik and @zimmerrol If you want to do research on AI oversight and control, apply now!

0

20

2

1K

zimmerrol retweeted

Samuel Albanie 🇬🇧

@SamuelAlbanie

10 months ago

the model card builds on foundational research on stealth and situational awareness from our team led by @MaryPhuong10 @zimmerrol @vkrakovna @davlindner, Ziyue Wang, @sarah_cogan, @AllanDafoe @_lewisho and @rohinmshah https://t.co/T8NEvsNM1x

0

11

3

2K

zimmerrol retweeted

Joschka Braun @BraunJoschka

11 months ago

Can reasoning models strategically sabotage their own reinforcement learning training by deliberately under-exploring? I’m currently exploring this question in MATS 8.0, alongside @yoenoo_ and @DamonFalck, supervised by @emmons_scott, @davlindner and @zimmerrol.

BraunJoschka's tweet photo. Can reasoning models strategically sabotage their own reinforcement learning training by deliberately under-exploring?

I’m currently exploring this question in MATS 8.0, alongside @yoenoo_ and @DamonFalck, supervised by @emmons_scott, @davlindner and @zimmerrol. https://t.co/l6VCIzZjjn

1

29

2

4

1K

Roland S. Zimmermann @zimmerrol

11 months ago

Happy to have been part of this project! There is still so much more research to be done about understanding (in)capabilities of frontier models - important for building safe AI.

David Lindner @davlindner

11 months ago

Can frontier models hide secret information and reasoning in their outputs? We find early signs of steganographic capabilities in current frontier models, including Claude, GPT, and Gemini. 🧵

davlindner's tweet photo. Can frontier models hide secret information and reasoning in their outputs?

We find early signs of steganographic capabilities in current frontier models, including Claude, GPT, and Gemini. 🧵 https://t.co/ixhKV8JqZl

10

104

18

52

18K

0

8

3

0

570

zimmerrol retweeted

David Lindner @davlindner

11 months ago

Can frontier models hide secret information and reasoning in their outputs? We find early signs of steganographic capabilities in current frontier models, including Claude, GPT, and Gemini. 🧵

10

104

18

52

18K

zimmerrol retweeted

Sundar Pichai

@sundarpichai

12 months ago

Gemini 2.5 Pro + 2.5 Flash are now stable and generally available. Plus, get a preview of Gemini 2.5 Flash-Lite, our fastest + most cost-efficient 2.5 model yet. 🔦 Exciting steps as we expand our 2.5 series of hybrid reasoning models that deliver amazing performance at the Pareto frontier of cost and speed. 🚀

sundarpichai's tweet photo. Gemini 2.5 Pro + 2.5 Flash are now stable and generally available. Plus, get a preview of Gemini 2.5 Flash-Lite, our fastest + most cost-efficient 2.5 model yet. 🔦

Exciting steps as we expand our 2.5 series of hybrid reasoning models that deliver amazing performance at the Pareto frontier of cost and speed. 🚀

250

4K

439

405

1M

zimmerrol retweeted

Philipp Schmid

@_philschmid

12 months ago

Gemini 2.5 Technical Report 🧵

8

393

52

195

42K

Roland S. Zimmermann @zimmerrol

about 1 year ago

@florian_tramer @josh_vendrow To be fair, that particular puzzle is fairly obvious* if you carefully look at the given examples;) *Not sure if color blindness can make this harder?

1

0

42

Roland S. Zimmermann @zimmerrol

over 1 year ago

@tkipf Nope, not yet. I was hoping to drop them off at the consulate (deadline is Tuesday morning), but seems less and less likely that the documents arrive in time:/

1

0

222

zimmerrol retweeted

Google DeepMind @GoogleDeepMind

over 1 year ago

As we make progress towards AGI, developing AI needs to be both innovative and safe. ⚖️ To help ensure this, we’ve made updates to our Frontier Safety Framework - our set of protocols to help us stay ahead of possible severe risks. Find out more → https://t.co/YwtVDqQWW9

98

498

72

81

109K

zimmerrol retweeted

Allan Dafoe

@AllanDafoe

over 1 year ago

I'm proud of GoogleDeepMind/Google's v2 update to our Frontier Safety Framework. We were the first major tech company to produce an explicit risk management framework for extreme risks, and I'm glad we are continuing to push ahead on safety best practice. https://t.co/CeXSDoTJeo

3

116

17

38

9K

Roland S. Zimmermann

@zimmerrol

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users