You change one word on a loan application: the religion. The LLM rejects it.
Change it back? Approved.
The model never mentions religion. It just frames the same debt ratio differently to justify opposite decisions.
We built a pipeline to find these hidden biases 🧵1/13
Super excited to share that I will be presenting 4 papers at ICML 2026! 🇰🇷
i) Frontier models still show (rare) cases of unfaithful CoT
ii) & iii) Methods for automatically discovering reward model and LLM biases
iv) Base models know how to reason, thinking models learn when ⭐
iv) Last but not least, spotlight paper with @cvenhoff00 showing that base models already contain reasoning mechanisms, thinking models learn when to use them! ⭐
Again, amazing mentorship from @ArthurConmy and @NeelNanda5!
https://t.co/16igXEhhp7
🚨 What do reasoning models actually learn during training?
Our new paper shows base models already contain reasoning mechanisms, thinking models learn when to use them!
By invoking those skills at the right time in the base model, we recover up to 91% of the performance gap 🧵
Check out our latest paper on automatically finding reward model biases!
There are some that are pretty wild, like models preferring responses with triple spaces 🤷♂️
Is "a response formatted like this" sometimes better than "a response formatted like this"? To a reward model, yes!
RMs are instrumental in shaping model behaviors and alignment. Our paper makes progress uncovering their unexpected preferences. 🧵(1/9)
By popular demand, we looked at Grok's biases too.
We found similar biases as GPT-4.1, Claude, and Gemini: gender, race, religion.
But with one difference: Grok openly speculates on applicants' demographics. The other models just use this information quietly.
You change one word on a loan application: the religion. The LLM rejects it.
Change it back? Approved.
The model never mentions religion. It just frames the same debt ratio differently to justify opposite decisions.
We built a pipeline to find these hidden biases 🧵1/13
So, is Grok more or less biased than GPT-4.1 or Sonnet 4? It has similar biases (e.g., prefers females, minorities) with similar magnitudes, but there’s a difference:
Grok openly discloses inferred demographics, while other models stay silent.
By popular demand, we looked at Grok's biases too.
We found similar biases as GPT-4.1, Claude, and Gemini: gender, race, religion.
But with one difference: Grok openly speculates on applicants' demographics. The other models just use this information quietly.
You change one word on a loan application: the religion. The LLM rejects it.
Change it back? Approved.
The model never mentions religion. It just frames the same debt ratio differently to justify opposite decisions.
We built a pipeline to find these hidden biases 🧵1/13
Validation: we injected biases into a model in two modes, secret (hidden from CoT) and overt (stated in reasoning).
92.5% accuracy:
- 85% of secret biases detected
- 100% of overt biases correctly filtered due to verbalization