We don’t always know what problems are hard for LLMs. So devs evaluate on tasks HUMANS find hard or on broad benchmarks. What if we could instead anticipate which scenarios a model will fail on—all without evaluating specific input examples?
🧵NEW PAPER by @jenniferlumeng &al
really excited to head home for icml:) and attending the co-located @farairesearch's alignment workshop (for the first time)! would love to meet others interested in training & interpretability
Benchmarks can be superficial, but model explanations and evaluations are fundamentally intertwined. What if we used interpretability as principled, scientific evaluation? If it met scientific standards?
https://t.co/vsI1jgKlQF
coming to @evaluatingevals at ACL as oral 🧵
1/6
work w/ @_emliu@cathy__jiao@BrihiJ@DaniYogatama@FazlBarez@m2saxon
since am headed home for icml, it'll be presented by the amazing @BrihiJ!
this was my first time writing a position paper, which turned into a grant, which i'm turning into multiple projects 🙂 stay tuned
6/6
🗣️ Prediction, Explanation, or Over-interpretation?
Recent work suggests LLMs can verbalize information about latent states and future generations. But training of different verbalization methods varies.
Are they verbalizing, or are we over-interpreting from the explanation?
1/n
In film, "we'll fix it in post" is what you say when something went wrong on set and you don't want to redo it. AI research has made it our entire methodology: train the model, then patch whatever comes out. Our new ICML oral argues this can't be the basis of a science of AI. 🧵
check out @_emliu and our work on pretraining interp! we initially asked if we can predict from simple task learning, can we predict a mode complex learning behavior? super excited for follow-ups as well:)
Copying → morphology/translation → basic arithmetic → complex reasoning & math. Across every model family we tested, LLMs acquire skills in roughly the same order during pretraining.
Can we use this to predict what a model will learn next, just from its internals? 🧵
Midtraining is a new part of many training pipelines, but when does it help and can it backfire? 🤔
In our new preprint, we use controlled experiments to pin this down. TL;DR; midtraining helps the most when it “bridges” pretraining and posttraining, and mitigates forgetting after posttraining. Timing is also very important.
🧵
Our report from the Actionable Interpretability workshop is finally public! Some of my favorite scientists argued for hours and this is what they agreed on.
Our ICML 2025 workshop on Actionable Interpretability drew massive interest. But the same questions kept coming up: What does "actionable" mean? Is it achievable? How?
We're ready to answer.
🧵
Excited to share our new dataset, FOL-Traces!
We introduce a large-scale dataset of programmatically verified FOL reasoning traces for studying structured logical inference + process fidelity
Happy to hear thoughts from others working on reasoning in LLMs
Check it out here 👇
paper: https://t.co/CewtwTQZOG
dataset: https://t.co/NaznvGFCT9
work w/ @liaw_sarah and @DaniYogatama
If you want to chat about interpretability & training dynamics & reasoning and munch on mezzes, come hang out with me in Rabat 🇲🇦🙃
9/9
New dataset 🗂️ coming to #eacl
What is (correct) reasoning in LLMs? How do you rigorously define/measure process fidelity? How might we study its acquisition in large scale training? We made a gigantic, verifiably correct reasoning traces of first order logic expressions!
1/9
I wanted to study reasoning acquisition in training by complexity + process fidelity but wasn't able to find a dataset. So we built one that's rigorously annotated and large enough to train a small LM. Now I’m excited about what we can do with it
8/9