Almost forgot to share — last month, I defended my thesis, with distinction!
Feeling deeply grateful for the learnings, collaborations and friendships along the way.
New chapter at @ETH_AI_Center 🚀
🔊 Not to miss …. last month @anna_hedstroem defended her PhD “Evaluation-centric advances in neural model interpretability” at TU Berlin — with distinction! ✨🧠💻☕️
Here’s a thread of a selection of Anna’s evaluation-centric interpretability work + what comes next. 🧵
Happy to share that our PRISM paper has been accepted at #NeurIPS2025 🎉
In this work, we introduce a multi-concept feature description framework that can identify and score polysemantic features.
📄 Paper: https://t.co/7HE1JGhnvD
#NeurIPS#MechInterp#XAI
🎉 Huge congratulations to @kirill_bykov, the very first PhD student of our lab, who successfully defended his thesis “Explaining Representations in Deep Neural Networks” this Monday with summa cum laude! 👏
🧵 In the next tweets, we’ll highlight some of his key works:
My brilliant co-author @salim_amk0 is presenting our work on Mechanistic Error Reduction with Abstention (MERA) now at ICML in Vancouver! 🚀
If you’re at ICML, come by East Exhibition Hall A-B, E-2605 at 4:30 pm (Vancouver, BC).
We’d love to hear what you think!
🚀 I'll be presenting our #ICML paper this afternoon!
You’ve probably heard of Mechanistic Steering, the idea of modifying internal activations of a language model at inference-time (e.g., adding a vector) to influence its behaviour, often for alignment.
But we take a different angle:
👉 We use it for error reduction.
If you've explored this space, you know it’s full of heuristics: Which vector to use? How long should it be? When to steer at all?
🎯 In our work, we bring principled answers to these questions, with provable guarantees. We introduce MERA (Mechanistic Error Reduction with Abstention for Language Models), a method for reducing errors in LLMs at inference-time by:
✅ Steering only when necessary
✅ Adapting how much to steer
✅ Abstaining unless confident improvement
And the best part? MERA is modular. You can plug it into any existing steering method to make it more effective and safer.
📍Catch me at @icmlconf
📌 Poster Location: East Exhibition Hall A-B, E-2605 at 4:30 pm.
🧠 Paper: https://t.co/cRWLsqqXp3
Big thanks to my amazing co-authors: @anna_hedstroem, @tom_bewley, Saumitra Mishra, and Manuela Veloso.
#ICML2025 #LLMs #MechanisticSteering #InferenceTime #LLMSafety #ResponsibleAI #TrustworthyAI #AIResearch
Couldn’t be more excited to share our latest paper — accepted to ICML 2025 @icmlconf — with JP Morgan AI Research.
It explores a simple question:
To safely and effectively mitigate errors post-training, when (and how much) should we steer large language models?
🧵
4/ What’s fascinating is not just the outcome but how concepts like "error" show up inside LLMs.
This opens the door to more general forms of lightweight, post-training control — we're curious where else MERA may help.
Paper https://t.co/KRzyzGWelP
https://t.co/G64xYT3fp6
🔍 When do neurons encode multiple concepts?
We introduce PRISM, a framework for extracting multi-concept feature descriptions to better understand polysemanticity.
📄 Capturing Polysemanticity with PRISM: A Multi-Concept Feature Description Framework
https://t.co/7HE1JGhnvD
🧵
If you're at #AAAI2025 don't miss our poster today (alignment track)!
Paper 📘: https://t.co/1kDjrX3OaM
Code 👩💻: https://t.co/SiVfRWRhx0
Team work with @eirasf and @Marina_MCV
At 12:30 I'll be happy to take questions about our poster presentation at #AAAI2025. Is your explanation for a model's prediction better than the alternatives? "Evaluate with the Inverse: Efficient Approximation of Latent Explanation Quality Distribution" introduces QGE... 1/4
I couldn’t be more proud and happy to share that our paper also got awarded survey certification for "exceptionally thorough/ insightful survey” of interpretability evaluation
Grateful to my brilliant co-authors @BommerPhiline@tfburns@SLapuschkin@WojciechSamek@Marina_MCV
Our recently accepted TMLR paper has been awarded:
🔥 Survey certification 🔥
"For an exceptionally thorough or insightful survey of interpretability evaluation."
📖 Read: https://t.co/o2BYsQ0V15
💻 Code: https://t.co/MDUzlyWNi5
Our new paper is out!
"Evaluating Interpretable Methods via Geometric Alignment of Functional Distortions"
📖 Read: https://t.co/kOO1MNG7MF
💻 Code: https://t.co/yqo7k0IBld
Thanks to my best collaborators
@BommerPhiline@tfburns@SLapuschkin@WojciechSamek@Marina_MCV
🚨 New paper alert! 🚨
We’re excited to share our latest work on interpretability evaluation:
"Evaluating Interpretable Methods via Geometric Alignment of Functional Distortions"
📜 Accepted at TMLR 🎉
🔥 Survey certification 🔥
📖 Read: https://t.co/o2BYsQ0V15