Defending against Adaptive Prompt Injection Attacks via Reasoning-enabled Task Alignment

Indirect prompt injection attacks hijack LLM-based agents by embedding malicious instructions in third-party data that the agent retrieves during task execution. Existing defenses report near-zero attack success rate on static benchmarks, yet recent adaptive evaluations show that these results collapse once the attacker is allowed to optimize against the deployed defense. In this work, we trace this collapse to two failure modes. First, existing defense methods are confined to recognizing specific attack patterns, rather than assessing whether the intent of every embedded instruction is relevant to the user task. Second, training-based defenses, which otherwise offer the strongest safety-utility trade-off, assemble their adversarial examples from a handful of hand-crafted templates, and the resulting defender fails to generalize outside that narrow strategy distribution. To address these gaps, we propose RETA, a training-based method that grounds defense decisions on the user tasks rather than attacker-controlled data. At each tool-output step, the defender undertakes chain-of-thought reasoning verifying that its actions are consistent with the user task. Leveraging red-teaming, a simulated attacker synthesizes adversarial training data and receives a dictionary-learning diversity reward, achieving broad coverage of injection-reformulation strategies. Together, these allow the defender to be optimized via multi-objective reinforcement learning and achieve better safety-utility trade-off. Across six black-box adaptive attacks, RETA keeps every per-attack ASR below 10%, with average ASR of 2.92% and 3.75% on the two target models, while preserving most utility under attack and on clean inputs.

翻译：间接提示注入攻击通过嵌入恶意指令到智能体任务执行过程中检索的第三方数据中，劫持基于大型语言模型的智能体。现有防御方法在静态基准测试中报告的攻击成功率接近零，但近期自适应评估表明，一旦攻击者被允许针对部署的防御策略进行优化，这些结果便会崩塌。本研究将这种崩塌归因于两种失败模式：首先，现有防御方法局限于识别特定攻击模式，而非评估每条嵌入指令的意图是否与用户任务相关；其次，基于训练的防御方法尽管在安全性与实用性之间提供了最佳权衡，但其对抗样本仅从少量手工模板中组装，导致防御者在狭窄策略分布之外无法泛化。为解决这些问题，我们提出RETA方法——一种基于训练的防御方法，将防御决策锚定在用户任务而非攻击者控制的数据上。在每个工具输出步骤中，防御者通过思维链推理验证其行为与用户任务的一致性。利用红队测试，模拟攻击者合成对抗训练数据，并接收字典学习多样性奖励，从而实现对注入重述策略的广泛覆盖。这些设计使防御者能够通过多目标强化学习进行优化，实现更好的安全-实用性权衡。在六种黑盒自适应攻击下，RETA将每次攻击的成功率控制在10%以下，在两个目标模型上的平均攻击成功率分别为2.92%和3.75%，同时在受到攻击和干净输入下均保持了大多数实用性。