Neural networks might speak English, but they think in shapes.
Understanding their rich *neural geometry* is key to understanding how they work – and to debugging and controlling them with precision.
Starting today, we’re releasing a series of posts on this research agenda. 🧵
We're hosting a happy hour at ICML, Wednesday July 8!
Come connect with members of the Goodfire team. Learn about our work in neural geometry and other recent publications.
Note that space is limited, and we’re prioritizing attendees who are actively engaged in relevant AI research areas.
Link to register in the thread!
Happy to see our work cited in the Claude Fable & Mythos system card!
Steering against eval awareness can carry confounds (e.g. making the model more friendly). Interpretability can help us understand these, and is a promising source of new methods to deal with eval awareness.
Have you debugged your training data? You might not like what you find.
Introducing predictive data debugging: reveal and shape what your model will learn before training.
In DPO datasets, we found broken guardrails, hallucinations, and fish fart fan fiction (seriously). (1/9)
@Sauers_ To each their own! (but on the other hand, we'd bet the Olmo team didn't intend for this to make up such a significant cluster of their DPO data)
If you train models on preference data, you have a curriculum you've never read. Predictive data debugging lets you read it, understand it, and rewrite it. We've built it into Silico, our platform for model design.
Request access to Silico here: https://t.co/vnb9zRjmty
(9/9)