Excited to share that our work on Dynamic Safety Monitoring for Language Models is accepted at ICLR 2026!! Looking forward to chatting with people there :)
Thanks a lot to @philiptorr@ioannispatras@Adel_Bibi@FazlBarez !!
How can we efficiently monitor LLMs for safety? Strong monitors waste compute on easy inputs, but lightweight probes risk missing harms ⚠️
𝙏𝙧𝙪𝙣𝙘𝙖𝙩𝙚𝙙 𝙥𝙤𝙡𝙮𝙣𝙤𝙢𝙞𝙖𝙡 𝙘𝙡𝙖𝙨𝙨𝙞𝙛𝙞𝙚𝙧𝙨 (TPCs) address this by generalizing linear probes for dynamic monitoring! 💫
I had a very fun and wid-ranging conversation with @campbellclaret and @RoryStewartUK for The Rest is Politics: Leading, including around what European countries should do to avoid disempowerment in the face of AI progress.
One of the most lively conversations I've had in many years! Link below.
I’ll be in Brazil for ICLR! 🇧🇷
I’ll be talking about how we can use theory to interpret models during the Thursday morning poster session and afternoon oral session! (Oral in 201 A/B, P4-#4006)
Happy to talk about interp, theory, or other things! Send a DM!
1/ AI agents are increasingly powerful. Security has not yet caught up.
New from CNAS: our response to CAISI’s RFI on AI Agent Security, with @janet_e_egan and @CalebWithersDC. 🧵
Excited to share our recent work selected as an ICLR Oral!
We work towards answering how models learn to associate tokens and build semantic concepts. We find that early-stage features in attention-based models can be written as compositions of three basis features.
Excited to share that our work on Dynamic Safety Monitoring for Language Models is accepted at ICLR 2026!! Looking forward to chatting with people there :)
Thanks a lot to @philiptorr@ioannispatras@Adel_Bibi@FazlBarez !!
How can we efficiently monitor LLMs for safety? Strong monitors waste compute on easy inputs, but lightweight probes risk missing harms ⚠️
𝙏𝙧𝙪𝙣𝙘𝙖𝙩𝙚𝙙 𝙥𝙤𝙡𝙮𝙣𝙤𝙢𝙞𝙖𝙡 𝙘𝙡𝙖𝙨𝙨𝙞𝙛𝙞𝙚𝙧𝙨 (TPCs) address this by generalizing linear probes for dynamic monitoring! 💫
Our new @GoogleDeepMind paper studies novel activation probe architectures for classifying real-world misuse risks.
Our research has informed live deployments of probes in Gemini. 🧵
🚨New AI Safety Course @aims_oxford!
I’m thrilled to launch a new called AI Safety & Alignment (AISAA) course on the foundations & frontier research of making advanced AI systems safe and aligned at @UniofOxford
what to expect 👇
https://t.co/r9YHS3XJhR
How can we efficiently monitor LLMs for safety? Strong monitors waste compute on easy inputs, but lightweight probes risk missing harms ⚠️
𝙏𝙧𝙪𝙣𝙘𝙖𝙩𝙚𝙙 𝙥𝙤𝙡𝙮𝙣𝙤𝙢𝙞𝙖𝙡 𝙘𝙡𝙖𝙨𝙨𝙞𝙛𝙞𝙚𝙧𝙨 (TPCs) address this by generalizing linear probes for dynamic monitoring! 💫
Please find many more results on 4 LLMs (across base models, instruction-tuned models, and reasoning models), and ablations in the paper!
📰 Project: https://t.co/urC7CPTZAO
💻 Code: https://t.co/1nBL6XXP3w
📄 Paper: https://t.co/HJvyxldF26