The goblins might be coming for Claude.
Not literally.
But @OpenAI just took the kind of methodologically transparent post @Anthropic has been writing for five years — about reward hacking, mech interp, why models do weird things — and made it viral, charming, and self-aware.
Wrote about what that means. 🧌🧵