Proud to partner with @CommonSense to help develop a rigorous science around youth AI safety. Millions of kids are already using AI every day, and our understanding of these systems and their impacts has to catch up.
https://t.co/Ot7uoH3qlO
New blog post:"Building Technology to Drive AI Governance". I argue that many governance challenges are fundamentally bottlenecked by technical gaps, and consider case studies from other fields (food safety, climate change) that illustrate this dynamic.
Why does GPT-5.1 Codex score 6.5% worse than GPT-5 Codex on Terminal-Bench, with the same scaffold? 🧵
GPT-5.1 times out at ~2x the rate of GPT-5. Excluding timeouts, GPT-5.1 wins by 7.2%. We analyzed 256M+ tokens of traces and found this in under an hour. Here’s how 👇
We're hiring a Governance & Policy Fellow to help define how independent AI evaluation works in practice—setting standards, supporting mental health evals, and supporting government evaluators. Hybrid technical + policy background, $200K–$300K. Link in replies.
our circuit tracing codebase from this project is public now! https://t.co/w7ieuPgcpn
please try it out and ping me if you have any questions 😄 and expect more updates soon!
I admire the folks at Transluce a lot. They're super smart and have a good model for how to do useful AI oversight work without being embedded in (read: beholden to) any big AI labs. Read their stuff and consider supporting!
Transluce is a top-tier AI safety research lab - I follow their work as closely as work from our own safety teams at Anthropic. They're also well-positioned to become a strong third-party auditor for AI labs.
Consider donating if you're interested in helping them out!
All @TransluceAI work that I described in my NeurIPS mech interp workshop keynote is now out! ✨
Today we released Predictive Concept Decoders, led by @vvhuang_
Paper: https://t.co/fhAK9VozDZ
Blog: https://t.co/53t4oenA1N
And here's @damichoi95's work on scalably extracting latent representations of users from model internals: https://t.co/F8fs7rhaX7
Transluce is developing end-to-end interpretability approaches that directly train models to make predictions about AI behavior.
Today we introduce Predictive Concept Decoders (PCD), a new architecture that embodies this approach.
Chat with a live version of our PCD at https://t.co/hCnfYwtPq6. Try testing whether the decoder can accurately predict Llama-3.1-8B’s behavior, and check whether the decoder’s response is consistent with the encoder’s active concepts!