Wouldn’t it be great to have questions about LM internals answered in plain English? That’s the promise of verbalization interpretability. Unfortunately, our new paper shows that evaluating these methods is nuanced—and verbalizers might not tell us what we hope they do. 🧵👇1/9
Benchmarks can be superficial, but model explanations and evaluations are fundamentally intertwined. What if we used interpretability as principled, scientific evaluation? If it met scientific standards?
https://t.co/vsI1jgKlQF
coming to @evaluatingevals at ACL as oral 🧵
1/6
Copying → morphology/translation → basic arithmetic → complex reasoning & math. Across every model family we tested, LLMs acquire skills in roughly the same order during pretraining.
Can we use this to predict what a model will learn next, just from its internals? 🧵
1/ (New paper!)
If swapping the gender in an input prompt makes the AI model give a different answer it means that it has to have a gender bias, right? Wrong.
🧵on counterfactual prompting for LLM evals:
Paper: https://t.co/i3Zc0UlyFF
@StephenLCasper They definitely missed proper evals to be able to ensure something like this would work... my collaborators and I have been working in this space to understand faithfulness issues (see: https://t.co/an0hvJtxYm which was accepted to ICML) but they seem to have glossed over it?
@zhuokaiz There's a few other evals that we show in the Appendix of the paper too that also basically that this lack of verbalizing "privileged" knowledge persists across task types. But without enforcing this privileged constraint, you don't know which model's knowledge you're using
@zhuokaiz Yeah exactly, a lot of these works in this domain (including many of the subsequent works in verbalization incl. Activation Oracles + this white paper) seem to ignore the fact that the verbalizer is an LLM itself, and it's obvious the eval they do should reflect this notion.
Patients ask LLMs medical questions, but how they phrase it matters more than it should.
Our new preprint explores how different phrasings of patient health questions can lead to inconsistent conclusions, even with the same evidence. [1/6]
Full Paper: https://t.co/CPhz94eAfc
Midtraining is a new part of many training pipelines, but when does it help and can it backfire? 🤔
In our new preprint, we use controlled experiments to pin this down. TL;DR; midtraining helps the most when it “bridges” pretraining and posttraining, and mitigates forgetting after posttraining. Timing is also very important.
🧵
Can you solve this algebra puzzle? 🧩
cb=c, ac=b, ab=?
A small transformer can learn to solve problems like this!
And since the letters don't have inherent meaning, this lets us study how context alone imparts meaning. Here's what we found:🧵⬇️
Can models understand each other's reasoning? 🤔
When Model A explains its Chain-of-Thought (CoT) , do Models B, C, and D interpret it the same way?
Our new preprint with @davidbau and @csinva explores CoT generalizability 🧵👇
(1/7)
Come check out our #NeurIPS2025 paper “Do Automatic Factuality Metrics Measure Factuality?” on Friday!
We systematically investigate this question and find some surprising results 👇🧵
💻 Paper/Code/Blog: https://t.co/zHnEN1gXRq
Work w/ @byron_c_wallace
@CFGeek We actually showed some of the same pitfalls that you mentioned in our current paper on analyzing existing activation verbalization methods paper: https://t.co/an0hvJtxYm. There are possibly ways to avoid input structure, but we find it too easy to get the input information.
What's the right unit of analysis for understanding LLM internals? We explore in our mech interp survey (a major update from our 2024 ms).
We’ve added more recent work and more immediately actionable directions for future work. Now published in Computational Linguistics!
Wouldn’t it be great to have questions about LM internals answered in plain English? That’s the promise of verbalization interpretability. Unfortunately, our new paper shows that evaluating these methods is nuanced—and verbalizers might not tell us what we hope they do. 🧵👇1/9
@saprmarks@nsaphra@byron_c_wallace I agree there might be some statistical priors that influence the likelihood of seeing the fake names, but I think most of all we would like to show how the nature of evaluations drastically affects whether you're able to extract information that you intend to for verbalization.