My best interview in some time.
Rohin Shah leads AGI alignment/safety at DeepMind.
And he has a lot of spicy personal takes:
We probably won’t get catastrophic misalignment (00:49)
Safety 'commitments' have severe limitations (10:38)
The intelligence explosion probably isn't imminent (1:52:44)
Why he's not working to pause AI advances (51:44)
Pre-deployment evals aren't the right focus (for catastrophic risks) (37:41)
Signalling concern for safety sometimes diverts resources from actually making AI safe (01:09:51)
Reading AI thoughts is v useful for safety – and we'll probably be able to for years to come (54:17)
Governance is somewhat more likely to be the bottleneck than alignment (43:55)
Rohin's team doesn't have a veto, and that's OK (27:36)
Central banks are a promising model for regulating AI (33:34)
Also:
Google DeepMind's actual plan for building AGI safely (1:40:29)
How external researchers can positively influence big AI companies (2:21:55)
The roles GDM most needs to hire for (2:37:03)
On the 80,000 Hours Podcast. Links below - enjoy! (@rohinmshah)
New preprint: Codec-Robust Attacks on Audio LLMs
#CodecAttack
Lossy codecs (Opus, MP3, AAC) have been treated as a defense against adversarial audio. We show they're actually an attack surface.
Why does it survive?
The latent perturbation concentrates 88% of energy below 4 kHz, exactly where codecs allocate the most bits. A Jacobian analysis confirms this is structural: the decoder has no basis functions above 4 kHz.
We still listen to old songs not because they are the best recordings, but because they remind us of something. A place, a person, a feeling. There is usually something imperfect about them, and I think that imperfection is part of why they stay with us.
My daily research is in AI security, but I have also been interested in a different kind of threat lately. Not a technical one, but a cultural one. Questioning myself: what happens when more of the music, art, and stories around us are AI-generated? Not whether they will be good or bad, but whether they will carry the same weight over time. My recent blog post explores that question through the lens of why imperfection matters, how it connects to memory, and what we might quietly lose if it disappears.
It is a highly opinionated writing, not a research paper. Just a casual read. But it has been on my mind for a while and I wanted to share.
After supervising 20+ papers, I have highly opinionated views on writing great ML papers. When I entered the field I found this all frustratingly opaque
So I wrote a guide on turning research into high-quality papers with scientific integrity! Hopefully still useful for NeurIPS
Thanks for sharing! We explored a similar direction in our prior work "Bob's Confetti" where we use phonetically similar lyrics to regurgitate copyrighted music at inference time. As you mentioned, training-time attacks would be a cool next step!
Paper: https://t.co/WWC6pHnIkA
Demo: https://t.co/Fexc76t34j
@anmgoel Thanks for sharing this interesting work and I'm also curious about this problem. Evaluating privacy in audio modality can be different compared to text and wonder how BFT can also affect this task.
7/ Good news: two simple defenses bring JSR back to near-zero.
🛡️ Distant filtering (training time): pick benign samples farthest from harmful embeddings
🛡️ System prompt (inference time): just tell the model to refuse
Safety is fragile, but recoverable.