Sebastian Farquhar @seb_far - Twitter Profile

seb_far retweeted

7 days ago

Impressive realism in these alignment evals! Good news: unless explicitly prompted, Gemini models don't demonstrate scheming.

1

3

1

0

552

seb_far retweeted

Zac Kenton @ZacKenton1

7 days ago

Alignment auditing and evals, with an emphasis on simulating realistic (rather than red-team) settings. Interestingly, natural sabotage rates for Gemini (3%) are largely due to overeagerness: excessive roleplaying; too literal interpretation of instructions to optimise a goal

0

2

1

0

426

seb_far retweeted

David Lindner @davlindner

7 days ago

Will your AI agent secretly sabotage your work? Existing alignment evals don't directly answer this question Meet Gram: the alignment auditing tool we use to assess how likely AI agents are to engage in sabotage during internal deployments at @GoogleDeepMind

davlindner's tweet photo. Will your AI agent secretly sabotage your work? Existing alignment evals don't directly answer this question

Meet Gram: the alignment auditing tool we use to assess how likely AI agents are to engage in sabotage during internal deployments at @GoogleDeepMind https://t.co/rYEtlHg6IO

3

80

10

43

12K

Sebastian Farquhar @seb_far

7 days ago

@vkrakovna @davlindner @_lewisho @rohinmshah

0

1

58

Who to follow

Pavel Izmailov

@Pavel_Izmailov

Researcher @AnthropicAI 🤖 Assistant Professor @nyuniversity 🏙️ Previously @OpenAI #StopWar 🇺🇦

Zoubin Ghahramani

@ZoubinGhahrama1

VP Research, Google DeepMind, ex-head of Google Brain. Professor at University of Cambridge. Machine Learning Researcher. ex-Chief Scientist & VP of AI, Uber.

David Krueger 🦥 ⏸️ ⏹️ ⏪

@DavidSKrueger

Raising AI risk awareness at https://t.co/Fat9r8oGp0 AI prof at Mila. Formerly Cambridge, DeepMind, UK AISI. https://t.co/KKWOE9xKZB

Sebastian Farquhar @seb_far

7 days ago

Will coding agents take opportunities to undermine safeguards designed to oversee them? We tackle this with automated auditing using simulated agentic environments, and scheming honeypot evaluations based on real internal alignment research codebases. Read more in our blog post

1

7

1

3

661

Sebastian Farquhar @seb_far

7 days ago

Realistic honeypot evaluations for scheming propensity https://t.co/8hkiXlXqpe

1

0

1

59

seb_far retweeted

Victoria Krakovna

@vkrakovna

7 days ago

It's easy to show that an AI agent will scheme if you nudge it to. It's harder to tell if it would scheme naturally. We introduce realistic honeypot evaluations that put Gemini in internal deployment situations where it has an opportunity for sabotage, to see how it behaves.

vkrakovna's tweet photo. It's easy to show that an AI agent will scheme if you nudge it to. It's harder to tell if it would scheme naturally. We introduce realistic honeypot evaluations that put Gemini in internal deployment situations where it has an opportunity for sabotage, to see how it behaves. https://t.co/NbtG8QixMF

1

79

17

47

17K

seb_far retweeted

David Lindner @davlindner

5 months ago

New DeepMind x UK AISI paper: what would it take to prevent harm from misaligned AI agents via monitoring in real deployments? We wrote a safety case sketch for control monitoring No flashy results but lots of important details for deploying future AI agents safely!

davlindner's tweet photo. New DeepMind x UK AISI paper: what would it take to prevent harm from misaligned AI agents via monitoring in real deployments? We wrote a safety case sketch for control monitoring

No flashy results but lots of important details for deploying future AI agents safely! https://t.co/GTsIyk1koL

8

94

28

63

20K

Sebastian Farquhar @seb_far

5 months ago

The role: https://t.co/Nklvl2g8Du

2

7

0

11

2K

Sebastian Farquhar @seb_far

5 months ago

I'm hiring at DeepMind AGI Safety! Looking for research engineers to help assess catastrophic risks from frontier models. Our work directly informs safety cases and governance. Lon/SF/NYC - engineers/scientists both wanted

seb_far's tweet photo. I'm hiring at DeepMind AGI Safety! Looking for research engineers to help assess catastrophic risks from frontier models. Our work directly informs safety cases and governance.

Lon/SF/NYC - engineers/scientists both wanted https://t.co/s0HII4rZlI

19

442

43

282

28K

seb_far retweeted

Neel Nanda

@NeelNanda5

5 months ago

DeepMind AGI Safety is hiring! We're looking for research engineers to help assess catastrophic frontier risks from Gemini and whether our mitigations are sufficient. I think this is a highly impactful role and I'd love to get strong candidates! Lon/NYC/SF

NeelNanda5's tweet photo. DeepMind AGI Safety is hiring! We're looking for research engineers to help assess catastrophic frontier risks from Gemini and whether our mitigations are sufficient. I think this is a highly impactful role and I'd love to get strong candidates!

Lon/NYC/SF https://t.co/kVFvDTtnmV

18

484

42

258

38K

Sebastian Farquhar @seb_far

over 1 year ago

https://t.co/X71mwnrDpw

0

1

0

8

481

Sebastian Farquhar @seb_far

over 1 year ago

In the final stages of assembling your ICML submission? For an excellent paper, each section has a purpose and each paragraph and sentence is crafted to drive that purpose. Tips on how to get the most out of your paper in link reply 👇🔗

1

10

2

12

2K

seb_far retweeted

Anca Dragan

@ancadianadragan

over 1 year ago

New paper from my team on avoiding reward hacking. MONA reduced RL's ability to pursue a multi-turn reward hacking strategy by doing myopic optimization with a trusted advantage/value estimator. Note that this can mean a performance hit depending on how good that estimator is, and it's important to keep pushing on that safe and capable pareto frontier. https://t.co/tgXwiiu5RX

2

47

6

16

5K

seb_far retweeted

Rohin Shah @rohinmshah

over 1 year ago

New AI safety paper! Introduces MONA, which avoids incentivizing alien long-term plans. This also implies “no long-term RL-induced steganography” (see the loans environment). So you can also think of this as a project about legible chains of thought. https://t.co/xCj4vo3Qn5

0

88

14

32

7K

seb_far retweeted

Séb Krier

@sebkrier

over 1 year ago

Check out the paper itself: https://t.co/G454xlESzz An introductory explainer: https://t.co/QNwfidolHa The technical safety post: https://t.co/mZGPGoU0WS Congrats @seb_far, @VikrantVarma_, @davlindner, @davidelson, @CalebBiddulph, @goodfellow_ian, and @rohinmshah!

0

6

1

630

Sebastian Farquhar @seb_far

over 1 year ago

By default, LLM agents with long action sequences use early steps to undermine your evaluation of later steps; a big alignment risk. Our new paper mitigates this, keeps the ability for long-term planning, and doesnt assume you can detect the undermining strategy. 👇

David Lindner @davlindner

over 1 year ago

New Google DeepMind safety paper! LLM agents are coming – how do we stop them finding complex plans to hack the reward? Our method, MONA, prevents many such hacks, *even if* humans are unable to detect them! Inspired by myopic optimization but better performance – details in🧵

davlindner's tweet photo. New Google DeepMind safety paper! LLM agents are coming – how do we stop them finding complex plans to hack the reward?

Our method, MONA, prevents many such hacks, *even if* humans are unable to detect them!

Inspired by myopic optimization but better performance – details in🧵 https://t.co/tJIA4r7dLF

16

566

96

473

158K

1

19

1

5

1K

Sebastian Farquhar @seb_far

over 1 year ago

@MaxiIgl I just started with blu e sky. Missing a lot of people, but the posts are much better.

1

0

97

Sebastian Farquhar @seb_far

over 1 year ago

Did you know that on the other twitter-like sites people actually post links to neat articles and pages? I'd forgotten what a killer feature that was. 10x value from 1/10th the posts.

1

6

0

677

Sebastian Farquhar

@seb_far

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users