AutoDojo: Adaptive Attacks Expose Superficial Defenses and User-Underspecification Limits in LLM Agents

Indirect prompt injection (IPI) is a major security threat to LLM-powered agents. Thus, a growing body of work have proposed a variety of defensive approaches against IPI. These can be grouped into three broad categories: 1) prompt-based (using prompting as a way to prevent agents from following malicious instructions), 2) detection-based (identifying and filtering malicious instructions), and 3) system-level (using systems insights, such as control and data isolation, for defense). However, commonly used benchmarks for evaluating defense, such as AgentDojo, are \emph{inherently static}, generating a fixed distribution of IPI attacks. Consequently, static benchmarks do not usefully evaluate defense robustness to adaptive threats. We address this issue by developing AutoDojo, an adaptive extension of AgentDojo that optimizes IPI against a given defense. Using AutoDojo against state-of-the-art IPI defenses across three task suites and five target models, we make two key observations. First, many defenses offer only limited protection: a cheap, black-box adaptive attack using a frontier LLM to iteratively optimize the injection raises attack success rate (ASR) well above the level achieved by static injections against nearly all evaluated defenses. Against a filter that reduces static ASR to 0\%, AutoDojo recovers 28\% overall and 64\% on action-open tasks. Second, for prompt-level and filter-based defenses, ASR is substantially higher on \emph{action-open} tasks -- where the user's request delegates the action itself to attacker-controlled content -- than on precisely specified tasks. This is a structural limit: on such tasks the injection can pose as ordinary data rather than an explicit instruction, bypassing defenses that rely on detecting instruction-like text. AutoDojo is publicly available at https://github.com/xhOwenMa/AutoDojo.

翻译：间接提示注入（IPI）是LLM驱动智能体面临的一项重大安全威胁。为此，越来越多的研究提出了多种防御IPI的方法，可分为三大类：1）基于提示（通过提示防止智能体执行恶意指令）、2）基于检测（识别并过滤恶意指令）、3）系统级（利用系统洞察，如控制与数据隔离进行防御）。然而，现有评估防御的基准（如AgentDojo）本质上是静态的，仅生成固定分布的IPI攻击。因此，静态基准无法有效评估防御对自适应威胁的鲁棒性。为解决此问题，我们开发了AutoDojo——AgentDojo的自适应扩展，用于针对给定防御优化IPI。通过在三个任务套件和五个目标模型上对现有最先进IPI防御应用AutoDojo，我们得出两个关键发现。首先，许多防御仅提供有限保护：一种低成本的黑盒自适应攻击（使用前沿LLM迭代优化注入）在几乎所有评估的防御中，将攻击成功率（ASR）显著提升至静态注入水平之上。针对一个能将静态ASR降至0%的过滤器，AutoDojo在整体任务中恢复28%的ASR，在动作开放任务中恢复64%。其次，对于提示级和基于过滤器的防御，在“动作开放”任务（用户请求将动作本身委托给攻击者控制的内容）上的ASR显著高于精确指定的任务。这是一种结构性限制：在此类任务中，注入可伪装为普通数据而非显式指令，从而绕过依赖检测指令性文本的防御。AutoDojo代码已开源：https://github.com/xhOwenMa/AutoDojo