New Anthropic research: Natural Language Autoencoders.
Models like Claude talk in words but think in numbers. The numbers—called activations—encode Claude’s thoughts, but not in a language we can read.
Here, we train Claude to translate its activations into human-readable text.
hmm i should have used a better example for the first post in that tweet. llama confabulates a bit here and people are semi-rightfully arguing that llama is just confused, not lying.
here's a better one where llama's reasoning is more clear:
https://t.co/zvfhy4ohzB
An average person can't look a CT scan and identify cancer, but radiologists can.
An average person can't look at Llama's model activations and identify lying, but Natural Language Autoencoders sometimes can.
Here, an activation verbalizer shows Llama planning to lie. 🧵
Researchers can use the Neuronpedia interactive interface here: https://t.co/obViVrtTSC
And we’ve provided an annotated walkthrough: https://t.co/LLy54TFGbZ
This project was led by participants in our Anthropic Fellows program, in collaboration with Decode Research.
Announcement: we're open sourcing Neuronpedia! 🚀
This includes all our mech interp tools: the interpretability API, steering, UI, inference, autointerp, search, plus 4 TB of data - cited by 35+ research papers and used by 50+ write-ups.
What you can do with OSS Neuronpedia: 🧵
Neuronpedia now hosts Chain-of-Thought! Steer and inspect Deepseek-R1-Distill-Llama-8B with SAEs trained by @Open_MOSS on @neuronpedia (linked below). One fun initial result: the model can easily be steered into "overthinking/anxious" mode with a single latent.
Gemma Scope allows us to study how features evolve throughout the model and interact to create more complex ones.
Want to learn more? Here’s an interactive demo made by @neuronpedia - no coding necessary ↓ https://t.co/PpbYk0ujWd
Want to learn more? @neuronpedia have made a gorgeous interactive demo walking you through what Sparse Autoencoders are, and what Gemma Scope can do.
If this could happen pre-launch, I'm excited to see what the community will do with Gemma Scope now!
https://t.co/UuSLGLT7ug
Sparse Autoencoders act like a microscope for AI internals. They're a powerful tool for interpretability, but training costs limit research
Announcing Gemma Scope: An open suite of SAEs on every layer & sublayer of Gemma 2 2B & 9B! We hope to enable even more ambitious work
exciting new research from @apolloaisafety and @jordantensor: E2E SAEs (w/ ~700k features) are now live on @neuronpedia - the first to use dual UMAPs for visual comparison and exploration between SAE training methods.
check it out at https://t.co/w6CCHMxC18
Proud to share Apollo Research's first interpretability paper! In collaboration w @JordanTensor!
⤵️
https://t.co/ZkiW7XFPqe
Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning
Our SAEs explain significantly more performance than before!
1/
Terrific work by @saprmarks and team! 🥳
We really enjoyed working with them to get their Sparse Autoencoders onto @neuronpedia.
You can explore, search, and test their 622,594 features here: https://t.co/k5FJ5V3vX1
Can we understand & edit unanticipated mechanisms in LMs?
We introduce sparse feature circuits, & use them to explain LM behaviors, discover & fix LM bugs, & build an automated interpretability pipeline! Preprint w/ @can_rager, @ericjmichaud_, @boknilev, @davidbau, @amuuueller
1/ Introducing Neuronpedia: an open platform for interpretability research with hosting, visualizations, and tooling for Sparse Autoencoders (SAEs).
Let's try it out! ➡️
Neuronpedia lets us instantly test activations of SAE features with custom text. Here's a Star Wars feature:
5/ Thanks to @JBloomAus for support, @NeelNanda5 for TransformerLens, @ch402@nickcammarata for inspiration from OpenAI Microscope, and William Saunders for Neuron Viewer.
It's time to accelerate (interpretability research). 🚀🔬
https://t.co/Ty08dKe2XL
Super impressed by @johnnylin's Interactive Interface for exploring my GPT2 Small SAE Features. https://t.co/fI9t3r3eZk.
First 5000 for each layer are there with the rest coming shortly! We've updated the feature-activation highlighting to better show multiple fires per context!