@CreativeS3lf steganography would be one way you could hack this reward - but it seems to be quite robust in practice. this was a surprise!
L2 is indeed a poor proxy, but seems to be good enough for a lot of what we care about. we’re thinking about better distance metrics
trained the first natural language autoencoder on gpt-2 almost a year ago, now we have one on mythos.🥲
do read the paper/play with the live demo! so excited it's finally out.
New Anthropic research: Natural Language Autoencoders.
Models like Claude talk in words but think in numbers. The numbers—called activations—encode Claude’s thoughts, but not in a language we can read.
Here, we train Claude to translate its activations into human-readable text.
Before limited-releasing Claude Mythos Preview, we investigated its internal mechanisms with interpretability techniques. We found it exhibited notably sophisticated (and often unspoken) strategic thinking and situational awareness, at times in service of unwanted actions. (1/14)
We just shipped Claude Opus 4.6!
I’m also excited to share that for the first time, we used circuit tracing as part of the model's safety audit!
We studied why sometimes, the model misrepresents the results of tool calls.
We recently released a paper on Activation Oracles (AOs), a technique for training LLMs to explain their own neural activations in natural language.
We piloted a variant of AOs during the Claude Opus 4.6 alignment audit. We thought they were surprisingly useful! 🧵