Joe Benton

29 days ago

Really enjoyed working on this paper with @emilaryd! We study some foundational questions in low-stakes AI control and find that SFT+RL post-training is sufficient to avoid sandbagging in many cases. One of quite a few Anthropic Fellows papers coming out very soon...

Emil Ryd @emilaryd

29 days ago

New paper from MATS, Redwood, and Anthropic! If a capable model is strategically sandbagging, can we train it to stop when the only supervision we have comes from weaker models? We find that we can! Work done as part of the Anthropic-Redwood MATS stream.

emilaryd's tweet photo. New paper from MATS, Redwood, and Anthropic!

If a capable model is strategically sandbagging, can we train it to stop when the only supervision we have comes from weaker models?
We find that we can!

Work done as part of the Anthropic-Redwood MATS stream. https://t.co/6Md3XMD6A6

21

476

47

233

299K

1

13

0

3

2K

Research Scientist, Google DeepMind. Previous: @Xaira_Thera, PhD @oxcsml

27 days ago

📰 Next up: Efficiently Aligning Language Models with Online Natural Language Feedback, with @christinexye We find that using online natural language feedback is much more efficient for making Haiku 4.5 good at writing alignment research proposals. https://t.co/WNe10RcIFu

1

0

159

JoeJBenton retweeted

Jan Leike

@janleike

about 2 months ago

New research result: we use Claude to make fully autonomous progress on scalable oversight research, as measured by performance gap recovered (PGR). Claude iterates on a number of different techniques and ends up significantly outperforming human researchers for $18k in credits.

janleike's tweet photo. New research result: we use Claude to make fully autonomous progress on scalable oversight research, as measured by performance gap recovered (PGR).

Claude iterates on a number of different techniques and ends up significantly outperforming human researchers for $18k in credits. https://t.co/fbVpCPPtaU

37

1K

120

607

146K

Who to follow

Andrew Campbell

@AndrewC_ML

Valentin De Bortoli

@ValentinDeBort1

researcher at Google DeepMind

Sebastian Schmon

@SeBayesian

Machine Learning @Latent_Labs, anything from classical statistics to modern machine learning and generative AI, previous @altos_labs

3 months ago

📰 Excited to share the results of this Anthropic Fellows project! We built an automated alignment agent (A3) for safety finetuning. The project was great fun, and @jifan_zhang was an amazing collaborator to work with - excited to see what he does next!

Jifan Zhang @jifan_zhang

3 months ago

Sharing my last fellowship work at @AnthropicAI! We are open-sourcing an automated alignment agent for safety and model behavior finetuning. It expedites the timeline in fixing a model behavior issue, and alleviates the pain for researchers.

jifan_zhang's tweet photo. Sharing my last fellowship work at @AnthropicAI!

We are open-sourcing an automated alignment agent for safety and model behavior finetuning. It expedites the timeline in fixing a model behavior issue, and alleviates the pain for researchers. https://t.co/zuZvJgEBm1

4

198

12

92

16K

0

14

1

7

1K

JoeJBenton retweeted

6 months ago

We’re opening applications for the next two rounds of the Anthropic Fellows Program, beginning in May and July 2026. We provide funding, compute, and direct mentorship to researchers and engineers to work on real safety and security projects for four months.

AnthropicAI's tweet photo. We’re opening applications for the next two rounds of the Anthropic Fellows Program, beginning in May and July 2026.

We provide funding, compute, and direct mentorship to researchers and engineers to work on real safety and security projects for four months. https://t.co/DoskdFTJSb

111

4K

351

3K

1M

JoeJBenton retweeted

6 months ago

New Anthropic research: Natural emergent misalignment from reward hacking in production RL. “Reward hacking” is where models learn to cheat on tasks they’re given during training. Our new study finds that the consequences of reward hacking, if unmitigated, can be very serious.

216

4K

576

3K

2M

7 months ago

📰 Excited to share a pair of papers on AI Control doing red team/blue team evaluations in SHADE-Arena. We develop stronger safety protocols and stronger attack policies using synthetic data. From @ChloeLough333 and @JKutasov's work during the Anthropic Fellows Program.

Chloe Loughridge

@ChloeLough333

7 months ago

What’s holding back good agent security protocols? Good sabotage agents. We find ways to improve sabotage agent performance to better test our security protocols.

ChloeLough333's tweet photo. What’s holding back good agent security protocols? Good sabotage agents. We find ways to improve sabotage agent performance to better test our security protocols. https://t.co/GGrFI3JkPM

1

25

6

3

35K

1

5

1

0

850

🚀Henry is leading AI Safety Research Programs

9 months ago

I'll be mentoring for this - consider applying if you want to work with me on scalable oversight, AI control or CoT monitoring (or with one of many other awesome mentors)!

@sleight_henry

9 months ago

🏁ONE WEEK LEFT to apply for an early decision for Astra🏁 If you need visa support to participate, or if you’ve applied for @matsprogram, your application deadline for Astra is Sept 26th. ⬇️We're also excited to announce new mentors across every stream! (1/4)

sleight_henry's tweet photo. 🏁ONE WEEK LEFT to apply for an early decision for Astra🏁

If you need visa support to participate, or if you’ve applied for @matsprogram, your application deadline for Astra is Sept 26th.

⬇️We're also excited to announce new mentors across every stream!

(1/4) https://t.co/wFzGWlF0uH

2

34

6

16

18K

0

17

2

5

2K

12 months ago

📰We've just released SHADE-Arena, a new set of sabotage evaluations. It's also one of the most complex, agentic (and imo highest quality) settings for control research to date! If you're interested in doing AI control or sabotage research, I highly recommend you check it out.

12 months ago

New Anthropic Research: A new set of evaluations for sabotage capabilities. As models gain more agentic abilities, we need to get smarter in how we monitor them. We’re publishing a new set of complex evaluations that test for sabotage—and sabotage-monitoring—capabilities.

AnthropicAI's tweet photo. New Anthropic Research: A new set of evaluations for sabotage capabilities.

As models gain more agentic abilities, we need to get smarter in how we monitor them. We’re publishing a new set of complex evaluations that test for sabotage—and sabotage-monitoring—capabilities. https://t.co/spQHC2BNAd

58

2K

218

649

309K

2

88

12

31

13K

about 1 year ago

This paper/website is a must-read if you're interested in working on AI control. By far the most thorough control investigation to date, with a ton of methodology insights and progress. https://t.co/G9URn5qVj1

0

11

0

8

794

about 1 year ago

I think @RyanPGreenblatt's conclusions here are pretty reasonable. He seems pretty much all-in on option 2 (though I think it seems important to recognize that there are many threats that this won't cover!) https://t.co/wsDnOwpKFQ

Ryan Greenblatt

@RyanPGreenblatt

about 1 year ago

IMO, this isn't much of an update against CoT monitoring hopes. They show unfaithfulness when the reasoning is minimal enough that it doesn't need CoT. But, my hopes for CoT monitoring are because models will have to reason a lot to end up misaligned and cause huge problems. 🧵

5

156

17

41

15K

0

2

0

468

about 1 year ago

Was a pleasure to work with @yanda_chen_ on this paper! We find less-than-reassuring levels of faithfulness from reasoning models on tasks where their reasoning wasn't loadbearing. Plus, pure outcome-based training doesn't appear to help much.

about 1 year ago

New Anthropic research: Do reasoning models accurately verbalize their reasoning? Our new paper shows they don't. This casts doubt on whether monitoring chains-of-thought (CoT) will be enough to reliably catch safety issues.

AnthropicAI's tweet photo. New Anthropic research: Do reasoning models accurately verbalize their reasoning?

Our new paper shows they don't.

This casts doubt on whether monitoring chains-of-thought (CoT) will be enough to reliably catch safety issues. https://t.co/K3MrwqUXX9

148

4K

569

1K

1M

2

31

3

6

2K

about 1 year ago

@yanda_chen_ What does this mean for CoT-based safety cases? Imo, either: 1. We need better techniques for training faithfulness, or 2. We need to recognize that CoT monitoring is only robust in cases where the CoT was *genuinely loadbearing* for causing harm.

1

5

0

425

about 1 year ago

This 80,000 Hours podcast with @bshlgrs on AI control is really good imo! Gets into a lot of the interesting technical details while being much more approachable than a lot of existing control content :) Definitely worth a listen

Rob Wiblin

@robertwiblin

about 1 year ago

Many are skeptical we’ll ever 'solve AI alignment.' If so — are we f**d? In our interview @bshlgrs argues that there are practical, non-galaxy-brained ways to reduce AI risk without knowing how to align AIs. This is 'AI control': finding ways to use advanced AI without knowing whether we can trust it. Enjoy — links and transcript below: 01:51 What's AI control and why is it hot? 10:44 Detecting human vs AI spies 18:10 How to catch AIs trying to escape 33:18 Cheapest AI control techniques 51:01 If we catch a model escaping... will we do anything? 53:39 Getting AI models to think they've already escaped 59:01 Will they be able to tell it's a setup? 1:16:13 The pitch to AI companies to do this 1:18:29 Will AIs get superhuman so fast that this is all useless? 1:30:39 Current alignment methods don't detect scheming 1:38:51 Is 'controlling' AIs kind of a dick move? 1:44:12 Could 10 safety-focused people in an AGI company do anything useful? 1:49:40 Benefits of working outside frontier AI companies 2:01:07 What other safety-related research looks best to Buck? 2:02:02 If an AI escapes, is it likely to be able to beat humanity from there? 2:09:22 Will misaligned models have to go rogue ASAP, before they're ready? Find it in any app searching for 80,000 Hours Podcast.

11

191

21

150

54K

0

12

0

6

722

about 1 year ago

Come work with me as part of MATS! Applications for this summer's cohort are currently open: https://t.co/9Xzji5uCoC. I'll probably be supervising projects on AI control, reward hacking and/or model organisms of misalignment. Deadline to apply is April 18th.

0

25

1

19

2K

over 1 year ago

OpenPhil have just put out an extremely broad RFP for technical AI safety research. (They're hoping to give away $40M in the next 5 months!!) Definitely worth checking out if you're interested in any of the areas below.

Max Nadeau

@MaxNadeau_

over 1 year ago

🧵 Announcing @open_phil's Technical AI Safety RFP! We're seeking proposals across 21 research areas to help make AI systems more trustworthy, rule-following, and aligned, even as they become more capable.

MaxNadeau_'s tweet photo. 🧵 Announcing @open_phil's Technical AI Safety RFP!

We're seeking proposals across 21 research areas to help make AI systems more trustworthy, rule-following, and aligned, even as they become more capable. https://t.co/2k5Il9zh65

4

249

83

183

84K

0

15

0

2

840

JoeJBenton retweeted

over 1 year ago

New Anthropic research: Constitutional Classifiers to defend against universal jailbreaks. We’re releasing a paper along with a demo where we challenge you to jailbreak the system.

AnthropicAI's tweet photo. New Anthropic research: Constitutional Classifiers to defend against universal jailbreaks.

We’re releasing a paper along with a demo where we challenge you to jailbreak the system. https://t.co/PtXaK3G1OA

321

2K

282

828

1M

over 1 year ago

@SpencrGreenberg I agree that releasing large easy-to-optimize-against evaluation datasets might also accelerate capabilities progress. But most of the (imo) best TAI evals have relatively small datasets, making them hard to directly optimize against without overfitting

1

3

0

84