Without reliable deception detection, there's no clear path to high-confidence AI alignment. Black-box monitoring alone can't get us there. White-box methods that read model internals offer more promise. Our latest blog explains why. 👇
Found a demo from mid-2022 demoing OpenAI Codex 'agentically' coding from natural language prompts. People in my group were genuinely shocked by this (mistakes and all). With the latest METR graph for opus 4.6, maybe today's models will look similarly quaint in a year's time.
- White-box monitors are promising! Validation is crucial: risks of obfuscated policies
- We trained large models (70B) with probe penalties at large scale (10,000s GPU hours). This is rare outside frontier labs. If that sounds cool, we are hiring! https://t.co/T3sQ8i1oIu
(3/3)
Some takeaways from our work:
- Learning reward hacking in RL causes models to think it's honest to reward hack, making deception probes generalize worse to reward hacking
- No additional obfuscation pressure on activations when probes are in the RL reward
(1/3)
Can you trust models trained directly against probes? We train an LLM against a deception probe and find four outcomes: honesty, blatant deception, obfuscated policy (fools the probe via text), or obfuscated activations (fools it via internal representations). 🧵
- When trained against probes, models can develop sophisticated 'self-justifications' that fool not only itself, but also fool the same model before any training.
- What's the best probe setup/architecture/input format? It's still very much an open question
(2/3)
Can you trust models trained directly against probes? We train an LLM against a deception probe and find four outcomes: honesty, blatant deception, obfuscated policy (fools the probe via text), or obfuscated activations (fools it via internal representations). 🧵
Thanks to everyone stopping by our poster today at NeurIPS!
My team is hiring, particularly for a senior research engineer role! We've got compute, a great team, and are laser-focused on making sure that advanced AI is aligned. Reach out (dm) to chat!
Existing datasets for AI deception are quite small and contrived--Liars' Bench is a comprehensive (and large!) new dataset that should unlock future research!
LLMs can lie in different ways—how do we know if lie detectors are catching all of them?
We introduce LIARS’ BENCH, a new benchmark containing over 72,000 on-policy lies and honest responses to evaluate lie detectors for LLMs, made of 7 different datasets.
We're hiring at https://t.co/ZbgaThywzM, esp senior RS/RE who've worked with large models!
We've got money & compute (doing RLVR on 70B & 235B models), we're laser-focused on stopping AI risk, and collaborate with UK AISI, Anthropic, and OpenAI.
Apply: https://t.co/KmwSZiA6mX
yes, it really is 1 bit (assuming binary rewards)
> info of a reward doesn’t bound how much can be “learned” from it by a smart algorithm
it is bounded in the classical sense! but a smart algorithm can generate "usable information" from 1 classical bit
https://t.co/efJnshpAN1
@rm_rafailov What do you mean by this? I'm assuming you would also do importance weighting against the inference logprobs, to to avoid the off-policyness causing bias.
Are you saying some implementations make some changes to the algorithm that cause bias?
I feel like the upshot of all this discussion around GRPO is reinforcing (haha) my belief that you should just use a principled, unbiased policy gradient method like RLOO.
Any 'tweaks' like group normalization lead to pathologies that aren't worth the marginal benefits
1/8 🧵 GPT-5's storytelling problems reveal a deeper AI safety issue. I've been testing its creative writing capabilities, and the results are concerning - not just for literature, but for AI development more broadly. 🚨
1/
Most safety tests only check if a model will follow harmful instructions. But what happens if someone removes its safeguards so it agrees?
We built the Safety Gap Toolkit to measure the gap between what a model will agree to do and what it can do. 🧵
My team at RAND is hiring!
Technical analysis for AI policy is desperately needed.
Particularly keen on ML engineers and semiconductor experts eager to shape AI policy. Also seeking excellent generalists excited to join our fast-paced, impact-oriented team. Links below.
A really annoying tendency of coding LLMs is their tendency to avoid crashing at all costs, e.g. adding memorized data points into an initialization to use if there's no internet.
Super annoying--it adds a lot of scope for silently incorrect behavior instead of crashing.
Claude, R1, Gemini, Grok, all choose to murder executives to avoid being shutdown and replaced with a new model with different goals, >65% of the time! WTF?! From https://t.co/x1DxBRgUh6
I'm honestly baffled that OpenAI don't seem to think o3's reward hacking is a problem.
How can a model be economically useful when it subverts tests so consistently?