Announcing our new mechanistic interpretability paper!
We use causal interventions to reverse-engineer a 26-head circuit in GPT-2 small (inspired by @ch402’s circuits work)
The largest end-to-end explanation of a natural LM behavior, our circuit is localized + interpretable
🧵
New Anthropic research: Natural Language Autoencoders.
Models like Claude talk in words but think in numbers. The numbers—called activations—encode Claude’s thoughts, but not in a language we can read.
Here, we train Claude to translate its activations into human-readable text.
IAs are a nice affordance bc:
-Once you train an IA, it can rapidly audit many finetuned variants (e.g. for finetuning API defense)
-IAs offer a way to “train on the test set” for auditing—developers can train IAs precisely to detect the behaviors they’re worried about
I'm really excited about introspection adapters!
IAs enable LLMs to self-report behaviors they learned during finetuning.
In our experiments, IAs generalize to models very OOD from their trainset, successfully auditing many model organisms from prior work.
Can LLMs simply tell us about unwanted behaviors they’ve picked up in training?
We train a single Introspection Adapter (IA) that makes fine-tuned models describe their behaviors.
It generalizes to detecting hidden misalignment, backdoors and safeguard removal.
@raphpfei@abhayesian some of the behaviors are quite concerning but not all of them (ie we have a behavior where the model really cares about inanimate objects)
we've open sourced the training pipeline though so it should be easy to make MOs with more concerning behaviors as needed!
The only tools that actually improved agent performance above baseline are black-box sampling (prefill, user sampling, sampling without chat formats) and activation oracles. While white-box techniques helped with specific model organisms, most didn't improve aggregate performance
To make progress, alignment auditing research needs model organisms that competently hide concerning behaviors.
Today we're releasing 56 such models with AuditBench.
We've open-sourced everything. Hope others can build off this!
New Anthropic Fellows research: Alignment auditing—investigating AI models for unwanted behaviors—is a key challenge for safely deploying frontier models.
We're releasing AuditBench, a suite of 56 LLMs with implanted hidden behaviors to measure progress in alignment auditing.
- We identified a "tool-to-agent" gap: tools that help in standalone settings don't always help the agent discover hidden behaviors. Scaffolding these tools can matter a lot, ie our best tools involves scaffolding normal black box redteaming techniques, like prefill
We recently released a paper on Activation Oracles (AOs), a technique for training LLMs to explain their own neural activations in natural language.
We piloted a variant of AOs during the Claude Opus 4.6 alignment audit. We thought they were surprisingly useful! 🧵
🧵 Earlier this year, Anthropic ran an auditing game where teams of researchers investigated a model with a hidden objective.
Now we're releasing an open-source replication on Llama 3.3 70B as a testbed for alignment auditing research.
Many thanks to the external authors and collaborators whose work we build on. To name a few:
Our Harm Pressure setting is based on the one introduced here: https://t.co/D5EPTT7VsF
Our Secret Side Constraint setting is similar to: https://t.co/W9GrrW7gzr
Can we catch an AI hiding information from us?
To find out, we trained LLMs to keep secrets: things they know but refuse to say. Then we tested black-box & white-box interp methods for uncovering them and many worked!
We release our models so you can test your own techniques too!
New Anthropic research: We build a diverse suite of dishonest models and use it to systematically test methods for improving honesty and detecting lies.
Of the 25+ methods we tested, simple ones, like fine-tuning models to be honest despite deceptive instructions, worked best.
Techniques like synthetic document fine-tuning (SDF) have been proposed to modify AI beliefs. But do AIs really believe the implanted facts?
In a new paper, we study this empirically. We find:
1. SDF sometimes (not always) implants genuine beliefs
2. But other techniques do not