AI agents are vulnerable to indirect prompt injection attacks, where malicious instructions embedded in external content or tool outputs cause unintended or harmful behavior. Inspired by the well-established concept of firewalls, we show that a simple, modular, and model-agnostic defense operating at the agent--tool interface achieves perfect security with high utility across all four public benchmarks: AgentDojo, Agent Security Bench, InjecAgent and tau-Bench, while achieving a state-of-the-art security--utility tradeoff compared to prior results. Specifically, we employ two firewalls: a Tool-Input Firewall (Minimizer) and a Tool-Output Firewall (Sanitizer). Unlike prior complex approaches, this defense makes minimal assumptions about the agent and can be deployed out of the box. This makes it highly generalizable while maintaining strong performance without compromising utility. Our analysis also reveals critical limitations in these existing benchmarks, including flawed success metrics, implementation bugs, and most importantly, weak attacks, hindering progress. To address this, we present targeted fixes to these issues for AgentDojo and Agent Security Bench, and propose best practices for more robust benchmark design. Moreover, we introduce a three-stage attack strategy that cascades standard prompt injection attacks, second-order attacks, and adaptive attacks to evaluate the robustness beyond existing attacks. Overall, our work shows that existing agentic security benchmarks are easily saturated by a simple approach and highlights the need for stronger benchmarks with carefully chosen evaluation metrics and strong adaptive attacks.
翻译:AI智能体易受间接提示注入攻击的影响,此类攻击通过将恶意指令嵌入外部内容或工具输出中,引发非预期或有害行为。受防火墙成熟概念的启发,我们证明了一种部署在智能体-工具接口、简单模块化且与模型无关的防御方法,能在所有四个公开基准(AgentDojo、Agent Security Bench、InjecAgent 和 tau-Bench)上实现完美安全性并保持高实用性,相比先前成果取得了安全性-实用性的最优权衡。具体而言,我们采用两种防火墙:工具输入防火墙(最小化器)与工具输出防火墙(净化器)。与先前复杂方法不同,该防御对智能体假设极少,可开箱即用,从而在保持强性能的同时实现高度泛化性,且不牺牲实用性。我们的分析还揭示了现有基准的关键缺陷,包括有缺陷的成功指标、实现漏洞,以及最重要的一点——攻击能力较弱,这些阻碍了研究进展。为此,我们针对 AgentDojo 和 Agent Security Bench 提出了针对性修复方案,并给出更稳健基准设计的最佳实践。此外,我们提出一种三阶段攻击策略,级联标准提示注入攻击、二阶攻击和自适应攻击,以评估超越现有攻击的鲁棒性。整体而言,本研究表明现有智能体安全基准易被简单方法饱和,凸显了设计更强基准(包含精心选择的评估指标和强自适应攻击)的紧迫性。