Can LLMs simply tell us about unwanted behaviors they’ve picked up in training?
We train a single Introspection Adapter (IA) that makes fine-tuned models describe their behaviors.
It generalizes to detecting hidden misalignment, backdoors and safeguard removal.
New paper from MATS, Redwood, and Anthropic!
If a capable model is strategically sandbagging, can we train it to stop when the only supervision we have comes from weaker models?
We find that we can!
Work done as part of the Anthropic-Redwood MATS stream.
New Anthropic Fellows research: Alignment auditing—investigating AI models for unwanted behaviors—is a key challenge for safely deploying frontier models.
We're releasing AuditBench, a suite of 56 LLMs with implanted hidden behaviors to measure progress in alignment auditing.
Full paper: https://t.co/1HRAbko06a
LessWrong post: https://t.co/XiQ0Qgc6Sr
Authors: @iamadtyx, @allylyq, @JonathanMi98298, @FabienDRoger.
This research was completed through the MATS and Anthropic Fellows programs.
To avoid LLMs mimicking human mistakes on complex tasks, several methods have been proposed to steer LLMs without labels on a target task. We find that these methods often fail when faced with challenges they would face in the most safety-relevant applications.🧵
We believe that future work on better elicitation methods should use evals that capture these challenges and any others that face important UE applications. Our datasets are available on Hugging Face https://t.co/SPT7MAgmKK
The greedy approach is relatively cheap (O(n) forward passes rather than ICM’s O(n^2)) and performs as well as supervised fine-tuning on Alpaca and only slightly worse on the other datasets ICM used.
There’s no strong reason to expect these methods can elicit superhuman knowledge from more powerful base models. Eg they might elicit false human beliefs that are consistent and salient to the model. We’ll explore more challenging datasets to evaluate UE methods in upcoming work.