The attacker moves second: stronger adaptive attacks bypass defenses against LLM jail- Breaks and prompt injections

Update: 2025-10-18

Description

The academic paper discusses the critical flaws in current methods used to evaluate the robustness of large language model (LLM) defenses against jailbreaks and prompt injections. The authors argue that testing defenses with static or computationally weak attacks yields a false sense of security, as demonstrated by the fact that they successfully bypassed twelve different recent defenses with an attack success rate exceeding 90% in most cases. Instead, they propose that robustness must be measured against adaptive attackers who systematically tune and scale optimization techniques, including gradient descent, reinforcement learning, search-based methods, and human red-teaming. The paper emphasizes that human creativity remains the most effective adversarial strategy, and future defense work must adopt stronger, adaptive evaluation protocols to make reliable claims of security.

Comments

In Channel

Demystifying the Mechanisms Behind Emergent Exploration in Goal-conditioned RL

2025-10-2214:33

Rewriting History: A Recipe for Interventional Analyses to Study Data Effects on Model Behavior

2025-10-2219:04

A Definition of AGI

2025-10-2216:28

Provably Learning from Language Feedback

2025-10-2119:55

In-Context Learning for Pure Exploration

2025-10-2116:30

On the Role of Preference Variance in Preference Optimization

2025-10-2014:42

Training LLM Agents to Empower Humans

2025-10-2013:38

Richard Sutton Declares LLMs a Dead End

2025-10-2013:20

Demystifying Reinforcement Learning in Agentic Reasoning

2025-10-1915:21

Emergent coordination in multi-agent language models

2025-10-1913:57

Learning-to-measure: in-context active feature acquisition

2025-10-1916:02

Andrej Karpathy's insights: AGI, Intelligence, and Evolution

2025-10-1916:11

Front-Loading Reasoning: The Synergy between Pretraining and Post-Training Data

2025-10-1812:48

Representation-Based Exploration for Language Models: From Test-Time to Post-Training

2025-10-1817:02

The attacker moves second: stronger adaptive attacks bypass defenses against LLM jail- Breaks and prompt injections

2025-10-1816:08

When can in-context learning generalize out of task distribution?

2025-10-1619:44

The Art of Scaling Reinforcement Learning Compute for LLMs

2025-10-1613:41

A small number of samples can poison LLMs of any size

2025-10-1613:58

Dual Goal Representations

2025-10-1417:11

Welcome to the Era of Experience

2025-10-1416:42

00:00

The attacker moves second: stronger adaptive attacks bypass defenses against LLM jail- Breaks and prompt injections

#box-pro-ellipsis-176112074247022{-webkit-line-clamp:2;}The attacker moves second: stronger adaptive attacks bypass defenses against LLM jail- Breaks and prompt injections

The attacker moves second: stronger adaptive attacks bypass defenses against LLM jail- Breaks and prompt injections

Enoch H. Kang

The attacker moves second: stronger adaptive attacks bypass defenses against LLM jail- Breaks and prompt injections