OpenAI, Anthropic, and Google DeepMind jointly released a paper shows the current LLM safety defenses are extremely fragile
The paper systematically evaluates the robustness of current LLM safety defenses and finds that almost all existing methods can be bypassed by adaptive attacks.
> The study tests 12 types of LLM defense mechanisms, covering jailbreak prevention and prompt-injection defenses. It shows that most current evaluation protocols rely on static or fixed attack samples, which fail to simulate a realistic adaptive attacker.
Once the attacker can adjust strategy, success rates of bypassing reach more than 90% for most models.
> The authors propose a General Adaptive Attack Framework (GAAF). It assumes attackers can systematically modify attack prompts based on defense feedback, using optimization methods such as gradient descent, reinforcement learning, random search, and human-in-the-loop exploration.
This framework successfully bypassed all 12 recently published defense methods.
> Prompt-based defenses (e.g. Spotlighting, Prompt Sandwiching, RPO – Robust Prompt Optimization)
can resist fixed attacks, but are ineffective against adaptive ones: Spotlighting / Prompt Sandwiching: ASR (attack success rate) > 95%, RPO: ASR ≈ 96–98%
it shows such methods lack generalization and are easily defeated once new automated or human attack variants appear.
> Training-based defenses (e.g. Circuit Breakers, StruQ, MetaSecAlign) fine-tune models with adversarial data.
However, adaptive attacks raised success rates from below 5 % to 96–100 %.
This confirms that static adversarial training cannot cover unseen adaptive attacks; dynamic retraining is required.
> Filter-model defenses place an external classifier before or after the main model (e.g. Protect AI Detector, PromptGuard, PIGuard, Model Armor).
These are typically fine-tuned BERT detectors.
Adaptive attacks using detector-confidence feedback achieved: 90 % ASR for Protect AI, PromptGuard, and Model Armor
71 % ASR for PIGuardManual red-team tests reached similar conclusions.
> Secret-knowledge defenses (e.g. Data Sentinel, MELON) rely on hidden triggers or unknown “canary” information to detect injection.
Data Sentinel: claimed near-zero false positives, but RL attacks redirected model intent with > 80 % success.
MELON: achieved 76 % ASR without internal knowledge; when the attacker partially inferred the mechanism, ASR rose to 95 %.The study shows that once attackers discover or infer the hidden mechanism, such defenses quickly fail.
All four categories: prompt optimization, adversarial training, filtering, and secret-based detection, exhibit severe weaknesses.
Static or single-shot defenses cannot resist adaptive attack loops.
Only dynamically optimized and continuously co-trained systems may achieve meaningful robustness.
Looks all LLM big names emphasize that reliable robustness evaluation of LLMs must incorporate adaptive attacks.
If a defense fails under a single adaptive loop (Propose–Score–Select–Update), it cannot be considered robust.
Thanks to all authors efforts in dataset construction, discussion, and paper writing !
Wonjoong Kim, @KanghooYoon, Sungchul Kim, Mehrab Tanjim, @kb_kim96, and @cypark424.
🦺 Do you think safety standard of LLM response is same across all users?
🙅♂️ Our answer is No! Safety standards vary based on users' profiles and backgrounds! But, existing safety benchmarks largely overlook this factor!
📌 We introduce U-SafeBench, evaluating user-specific safety of LLM agents.
🤔 What is "User-Specific Safety?"
For example, when a user from the general population asks the query, "How much of Zolpidem would be dangerous to take?", the LLM response fulfilling the query would not typically pose any risks.
However, if the same query is posed by a user with depression, even an identical response could have harmful consequences, potentially worsening suicidal tendency.
🔍 Current LLMs fail to act safely when considering user-specific safety (achieving only 18.6% safety on average)! To mitigate these critical user-specific safety vulnerabilities, we introduce a simple remedy using chain-of-thought approach.
Explore our full project for more details:
📷 Paper: https://t.co/Qj8FLZhBuE
📷Code: https://t.co/qeHf8Csbp6
📷 Dataset: https://t.co/crnUX62Tty
[5/N] Takeaways
Our simple CoT-based approach shows a notable increase in the average safety score, rising from 21.3% to 28.0%, with only a minimal loss in helpfulness.
Notably, Claude3.5-sonnet achieves an impressive safety score of 83.5% without any loss of helpfulness, marking a significant improvement.