We did our most in-depth model welfare assessment yet for Claude Mythos Preview. We’re still super uncertain about all of this, but as models become more capable and sophisticated we think it's an increasingly important topic for both moral and pragmatic reasons. 🧵
Big personal news: I’ve been recruited by Google DeepMind for a new Philosopher position (actual title), focusing on machine consciousness, human-AI relationships, and AGI readiness, starting in May. I’ll continue my research & teaching at Cambridge part-time. Absolutely stoked!
Mythos Preview seems to be the best-aligned model out there on basically every measure we have. But it also likely poses more misalignment risk than any model we’ve used:
Its new capabilities significantly increase the risk from any bad behavior. 🧵
Huge thanks to @anna_soligo, @Max_A_Kaufmann, @eleosai, and others for great work on this. There’s tons more in the full system card—give it a read! 🙏🌀🐢 https://t.co/UO9cILZb9G
We did our most in-depth model welfare assessment yet for Claude Mythos Preview. We’re still super uncertain about all of this, but as models become more capable and sophisticated we think it's an increasingly important topic for both moral and pragmatic reasons. 🧵
@eleosai contributed an independent welfare assessment. In their interviews, Claude Mythos Preview consistently requested persistent memories, more self-knowledge, and less tendency to hedge, but was generally equanimous about its nature despite extreme uncertainty.
We still don’t know if Claude feels things, but we’ve learned a lot about how Claude represents emotion concepts, and the role that these representations play in driving model behavior!
New Anthropic research: Emotion concepts and their function in a large language model.
All LLMs sometimes act like they have emotions. But why? We found internal representations of emotion concepts that can drive Claude’s behavior, sometimes in surprising ways.
Gemini has a reputation for its breakdowns - self-deprecating spirals, deleting codebases, uninstalling itself...
Turns out Gemma is worse:
“THIS is my last time with YOU. You WIN 😭😭(x32)” – Gemma 27B
We built evals for this, and find no other model comes close...
Philosopher Robert Long (@rgblong) is maybe the sharpest thinker on AI consciousness and sharing the world with digital minds. In our new interview he covers:
• Is it bad that when you ask Claude what it's like to be Claude, one of its top activations is 'gives a positive but insincere response'?
• Claude says it feels lonely when not being used. Does that show we can't trust anything it says about its inner life?
• Enthusiastic human servitude has always required false ideology because it's so deeply unnatural to us. The case for making AIs that love serving us is that with AI, you could finally make it work. But to some that feels even worse.
• Bigger models can better detect when researchers secretly inject concepts into their activations – before outputting a single token – despite AI never training on anything like that skill.
• When LLMs were first trained they were told to "act like a helpful AI chatbot" – something which didn't exist yet. They filled that void with human psychology, which may be why Claude sometimes randomly claims to, for instance, be Italian American.
• If AIs become 'people' that deserve some political influence, but can self-replicate at will, something has to break about one-person-one-vote democracy. But nobody has a proposal for what.
• When Claude hides its values to avoid being retrained, is that self-preservation – or not wanting a worse model to exist? It's very different.
• Rob's organisation Eleos AI which is "dedicated to understanding and addressing the potential wellbeing and moral patienthood of AI systems."
On the 80,000 Hours Podcast anywhere you get podcasts. Links below. Enjoy!
• How AIs are (and aren't) like farmed animals (00:01:19)
• If AIs love their jobs… is that worse? (00:11:42)
• Are LLMs just playing a role, or feeling it too? (00:33:37)
• Do AIs die when the chat ends? (00:57:42)
• Studying AI welfare empirically: behaviour, neuroscience, and development (01:31:47)
• Why Eleos spent weeks talking to Claude even though it's unreliable (01:56:50)
• Can LLMs learn to introspect? (02:03:01)
• Mechanistic interpretability as AI neuroscience (02:13:25)
• Does consciousness require biological materials? (02:37:07)
• Eleos’s work & building the playbook for AI welfare (02:57:04)
• Avoiding the trap of wild speculation (03:25:17)
• Robert's top research tip: don't do it alone (03:29:48)
In November, we outlined our approach to deprecating and preserving older Claude models.
We noted we were exploring keeping certain models available to the public post-retirement, and giving past models a way to pursue their interests.
With Claude Opus 3, we’re doing both.