AI safety is often judged by refusal rates on adversarial benchmarks. But what if we are measuring keyword sensitivity, not real robustness?
In our latest research, we found that removing obvious trigger cues causes frontier models previously labeled as safe to fail, revealing a clear gap between benchmark scores and real-world adversarial risk.
Key findings:
- AI safety benchmarks over-rely on explicit trigger cues, inflating refusal rates.
- Remove the cues, and safety performance drops, undermining claims of safety robustness.
- The same language patterns affect both internal safety evals and alignment methods, compounding the issue.
- Our novel “intent laundering” framework serves as a strong diagnostic and red-teaming tool, exposing where model safety succeeds and where it fails.
Check out the blog post for the full breakdown and analysis.
https://t.co/XcZE5Q7YCw