How can a language model find the veggies in a menu?
New pre-print where we investigate the internal mechanisms of LLMs when filtering on a list of options.
Spoiler: turns out LLMs use strategies surprisingly similar to functional programming (think "filter" from python)! ๐งต
๐ง ๐ค The 2026 New England Mechanistic Interpretability (NEMI) Workshop will be Aug. 14 at Boston University!
Help spread the word and join the New England mech interp community! Registration and submission info in thread:๐
Can you tell when an AI model is lying?
Announcing Aletheia's Quest, an AI lie detection challenge running this summer, organized by @cadenza_labs and @ndif_team.
Multiple model organisms to interrogate and probe, $50K prize pool, no local GPU required.
Super excited to be attending @iclr_conf in Rio.
Stop by our poster tomorrow morning (10:30am - 1:00pm) in Pavilion 4 (P4-#4001) to know about list-processing mechanisms in LMs.
DMs are open. Please reach out if you want to meet up!
How can a language model find the veggies in a menu?
New pre-print where we investigate the internal mechanisms of LLMs when filtering on a list of options.
Spoiler: turns out LLMs use strategies surprisingly similar to functional programming (think "filter" from python)! ๐งต
Can you solve this algebra puzzle? ๐งฉ
cb=c, ac=b, ab=?
A small transformer can learn to solve problems like this!
And since the letters don't have inherent meaning, this lets us study how context alone imparts meaning. Here's what we found:๐งตโฌ๏ธ
Can models understand each other's reasoning? ๐ค
When Model A explains its Chain-of-Thought (CoT) , do Models B, C, and D interpret it the same way?
Our new preprint with @davidbau and @csinva explores CoT generalizability ๐งต๐
(1/7)
At the #Neurips2025 mechanistic interpretability workshop I gave a brief talk about Venetian glassmaking, since I think we face a similar moment in AI research today.
Here is a blog post summarizing the talk:
https://t.co/LSwBf9XQzE
I am very excited to share that our paper, "One-Step is Enough: Sparse Autoencoders for Text-to-Image Diffusion Models" will be presented at #NeurIPS2025!
@ViaSurkov is presenting it at #MexIPS2025:
๐๐๐ ๐ฒ๐จ๐ฎ ๐๐ซ๐ ๐๐ญ๐ญ๐๐ง๐๐ข๐ง๐ ๐๐๐ฎ๐ซ๐๐๐ ๐ข๐ง ๐๐๐ฑ๐ข๐๐จ ๐๐ข๐ญ๐ฒ, ๐ฉ๐ฅ๐๐๐ฌ๐ ๐ฌ๐ญ๐จ๐ฉ ๐๐ฒ!
Date: Thursday, Dec 4, 2025
Time: 11:00 AM โ 2:00 PM PST
Location: Foyer (Mexico City Poster Session)
Come visit @ViaSurkov it's his first conference and he will be happy to explain his amazing work.
Sadly, #NeurIPS2025 does not allow for parallel presentation in San Diego. However, I am in San Diego and happy to meet up / chat. Please don't hesitate to reach out here or via [email protected].
Once again, a big shout out to our brilliant students Viacheslav Surkov and Antonio Mari who did phenomenal work here and pushed this work (that started as a class project more than a year ago) all the way to pass the high threshold of #NeurIPS2025.
Also, I want to thank https://t.co/lXSt28RIh1 (@andyarditi and @ryan_kidd44 in particular) for helping us to finance Viacheslav Surkov's conference trip.
Please find more information about our work below. We have so many amazing interactive materials (e.g., 3x huggingface demo spaces) for you to check out. Most of our implementations are open-sourced (RIEBench on FLUX, which we added to our appendix during the NeurIPS rebuttal is currently missing but we plan to add it ASAP).
Me demoing the demo attached.
A key challenge for interpretability agents is knowing when theyโve understood enough to stop experimenting.
Our @NeurIPSConf paper introduces a self-reflective agent that measures the reliability of its own explanations and stops once its understanding of models has converged.
Thanks to my collaborators @giordanoprogers , @NatalieShapira, and @davidbau.
Checkout our paper for more details:
๐ https://t.co/A7cEMQlK7O
๐ป https://t.co/kiwYl9UOHv
๐ https://t.co/70UsQLGyn9
How can a language model find the veggies in a menu?
New pre-print where we investigate the internal mechanisms of LLMs when filtering on a list of options.
Spoiler: turns out LLMs use strategies surprisingly similar to functional programming (think "filter" from python)! ๐งต
The fact that the neural mechanisms implemented in transformer architecture align with human-designed symbolic strategies suggests that certain computational patterns rise naturally from task demands rather than specific architectural constraints.