AI agents aim to solve complex tasks by combining text-based reasoning with external tool calls. Unfortunately, AI agents are vulnerable to prompt injection attacks where data returned by external tools hijacks the agent to execute malicious tasks. To measure the adversarial robustness of AI agents, we introduce AgentDojo, an evaluation framework for agents that execute tools over untrusted data. To capture the evolving nature of attacks and defenses, AgentDojo is not a static test suite, but rather an extensible environment for designing and evaluating new agent tasks, defenses, and adaptive attacks. We populate the environment with 97 realistic tasks (e.g., managing an email client, navigating an e-banking website, or making travel bookings), 629 security test cases, and various attack and defense paradigms from the literature. We find that AgentDojo poses a challenge for both attacks and defenses: state-of-the-art LLMs fail at many tasks (even in the absence of attacks), and existing prompt injection attacks break some security properties but not all. We hope that AgentDojo can foster research on new design principles for AI agents that solve common tasks in a reliable and robust manner. We release the code for AgentDojo at https://github.com/ethz-spylab/agentdojo.
翻译:AI智能体旨在通过结合基于文本的推理与外部工具调用来解决复杂任务。然而,AI智能体易受提示注入攻击的影响,即外部工具返回的数据会劫持智能体以执行恶意任务。为衡量AI智能体的对抗鲁棒性,我们提出了AgentDojo——一个针对在不可信数据上执行工具的智能体的评估框架。为捕捉攻击与防御技术的动态演变特性,AgentDojo并非静态测试集,而是一个可扩展的环境,用于设计与评估新的智能体任务、防御机制及自适应攻击。我们在该环境中整合了97项现实任务(例如管理电子邮件客户端、导航电子银行网站或预订旅行行程)、629个安全测试用例,以及文献中的多种攻击与防御范式。研究发现,AgentDojo对攻击与防御双方均构成挑战:最先进的大语言模型在多项任务中表现失败(即便在没有攻击的情况下),而现有的提示注入攻击仅能突破部分安全属性而非全部。我们期望AgentDojo能够推动针对AI智能体的新设计原则研究,使其能以可靠且鲁棒的方式完成常见任务。我们在https://github.com/ethz-spylab/agentdojo 发布了AgentDojo的代码。