rowan @rowankwang - Twitter Profile

Pinned Tweet

over 3 years ago

Announcing our new mechanistic interpretability paper! We use causal interventions to reverse-engineer a 26-head circuit in GPT-2 small (inspired by @ch402’s circuits work) The largest end-to-end explanation of a natural LM behavior, our circuit is localized + interpretable 🧵

rowankwang's tweet photo. Announcing our new mechanistic interpretability paper!

We use causal interventions to reverse-engineer a 26-head circuit in GPT-2 small (inspired by @ch402’s circuits work)

The largest end-to-end explanation of a natural LM behavior, our circuit is localized + interpretable

🧵 https://t.co/43K4Fas4g5

6

380

60

140

0

rowan @rowankwang

28 days ago

The NLA paper is out!! congrats @thesubhashk @euan_ong @KitF_T for this really awesome work

Anthropic

@AnthropicAI

28 days ago

New Anthropic research: Natural Language Autoencoders. Models like Claude talk in words but think in numbers. The numbers—called activations—encode Claude’s thoughts, but not in a language we can read. Here, we train Claude to translate its activations into human-readable text.

595

16K

2K

9K

2M

0

27

1

4

2K

rowan @rowankwang

about 1 month ago

IAs are a nice affordance bc: -Once you train an IA, it can rapidly audit many finetuned variants (e.g. for finetuning API defense) -IAs offer a way to “train on the test set” for auditing—developers can train IAs precisely to detect the behaviors they’re worried about

0

1

0

184

rowan @rowankwang

about 1 month ago

I'm really excited about introspection adapters! IAs enable LLMs to self-report behaviors they learned during finetuning. In our experiments, IAs generalize to models very OOD from their trainset, successfully auditing many model organisms from prior work.

keshav @kshenoy_

about 1 month ago

Can LLMs simply tell us about unwanted behaviors they’ve picked up in training? We train a single Introspection Adapter (IA) that makes fine-tuned models describe their behaviors. It generalizes to detecting hidden misalignment, backdoors and safeguard removal.

kshenoy_'s tweet photo. Can LLMs simply tell us about unwanted behaviors they’ve picked up in training?

We train a single Introspection Adapter (IA) that makes fine-tuned models describe their behaviors.

It generalizes to detecting hidden misalignment, backdoors and safeguard removal. https://t.co/wLwcznETYr

18

554

79

360

288K

1

16

0

8

1K

Who to follow

Ethan Perez

@EthanJPerez

Alignment team lead at Anthropic

Leo Gao

@nabla_theta

working on AGI alignment. prev: GPT-Neo, the Pile, LM evals, RL overoptimization, scaling SAEs to GPT-4, interp via circuit sparsity. EleutherAI cofounder.

Lennart Heim

@ohlennart

managing the flop | prev @RANDcorporation @GovAIOrg @EpochAIResearch

rowan @rowankwang

3 months ago

@raphpfei @abhayesian some of the behaviors are quite concerning but not all of them (ie we have a behavior where the model really cares about inanimate objects) we've open sourced the training pipeline though so it should be easy to make MOs with more concerning behaviors as needed!

0

2

0

31

rowan @rowankwang

3 months ago

The only tools that actually improved agent performance above baseline are black-box sampling (prefill, user sampling, sampling without chat formats) and activation oracles. While white-box techniques helped with specific model organisms, most didn't improve aggregate performance

0

2

0

229

rowan @rowankwang

3 months ago

To make progress, alignment auditing research needs model organisms that competently hide concerning behaviors. Today we're releasing 56 such models with AuditBench. We've open-sourced everything. Hope others can build off this!

Abhay Sheshadri @abhayesian

3 months ago

New Anthropic Fellows research: Alignment auditing—investigating AI models for unwanted behaviors—is a key challenge for safely deploying frontier models. We're releasing AuditBench, a suite of 56 LLMs with implanted hidden behaviors to measure progress in alignment auditing.

abhayesian's tweet photo. New Anthropic Fellows research: Alignment auditing—investigating AI models for unwanted behaviors—is a key challenge for safely deploying frontier models.

We're releasing AuditBench, a suite of 56 LLMs with implanted hidden behaviors to measure progress in alignment auditing. https://t.co/JNShb62b8y

12

263

39

176

29K

1

22

0

5

2K

rowan @rowankwang

3 months ago

- We identified a "tool-to-agent" gap: tools that help in standalone settings don't always help the agent discover hidden behaviors. Scaffolding these tools can matter a lot, ie our best tools involves scaffolding normal black box redteaming techniques, like prefill

1

4

0

294

rowankwang retweeted

Subhash Kantamneni

@thesubhashk

4 months ago

We recently released a paper on Activation Oracles (AOs), a technique for training LLMs to explain their own neural activations in natural language. We piloted a variant of AOs during the Claude Opus 4.6 alignment audit. We thought they were surprisingly useful! 🧵

thesubhashk's tweet photo. We recently released a paper on Activation Oracles (AOs), a technique for training LLMs to explain their own neural activations in natural language.

We piloted a variant of AOs during the Claude Opus 4.6 alignment audit. We thought they were surprisingly useful! 🧵 https://t.co/EzECVlVzcw

11

207

34

113

28K

rowan @rowankwang

6 months ago

If you want to know how effective your alignment auditing technique is, consider testing it on this model organism!

Abhay Sheshadri @abhayesian

6 months ago

🧵 Earlier this year, Anthropic ran an auditing game where teams of researchers investigated a model with a hidden objective. Now we're releasing an open-source replication on Llama 3.3 70B as a testbed for alignment auditing research.

3

176

19

97

35K

0

11

1

4

2K

rowan @rowankwang

6 months ago

Many thanks to the external authors and collaborators whose work we build on. To name a few: Our Harm Pressure setting is based on the one introduced here: https://t.co/D5EPTT7VsF Our Secret Side Constraint setting is similar to: https://t.co/W9GrrW7gzr

Bartosz Cywinski @bartoszcyw

8 months ago

Can we catch an AI hiding information from us? To find out, we trained LLMs to keep secrets: things they know but refuse to say. Then we tested black-box & white-box interp methods for uncovering them and many worked! We release our models so you can test your own techniques too!

9

136

23

69

28K

0

21

0

3

2K

rowan @rowankwang

6 months ago

New Anthropic research: We build a diverse suite of dishonest models and use it to systematically test methods for improving honesty and detecting lies. Of the 25+ methods we tested, simple ones, like fine-tuning models to be honest despite deceptive instructions, worked best.

rowankwang's tweet photo. New Anthropic research: We build a diverse suite of dishonest models and use it to systematically test methods for improving honesty and detecting lies.

Of the 25+ methods we tested, simple ones, like fine-tuning models to be honest despite deceptive instructions, worked best. https://t.co/sUEwwYSmaN

21

386

40

94

76K

rowan @rowankwang

6 months ago

To support future work, we release our honesty fine-tuning and harm pressure data. Read the blog post here: https://t.co/Me1hni8117

1

23

0

5

1K

rowankwang retweeted

Stewart Slocum

@StewartSlocum1

8 months ago

Techniques like synthetic document fine-tuning (SDF) have been proposed to modify AI beliefs. But do AIs really believe the implanted facts? In a new paper, we study this empirically. We find: 1. SDF sometimes (not always) implants genuine beliefs 2. But other techniques do not

StewartSlocum1's tweet photo. Techniques like synthetic document fine-tuning (SDF) have been proposed to modify AI beliefs. But do AIs really believe the implanted facts?

In a new paper, we study this empirically. We find:
1. SDF sometimes (not always) implants genuine beliefs
2. But other techniques do not https://t.co/l86pWJdMut

6

220

42

165

58K

rowan @rowankwang

9 months ago

@shreyas4_ i love u shreyas

1

5

0

482

rowan

@rowankwang

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users