Recent work has embodied LLMs as agents, allowing them to access tools, perform actions, and interact with external content (e.g., emails or websites). However, external content introduces the risk of indirect prompt injection (IPI) attacks, where malicious instructions are embedded within the content processed by LLMs, aiming to manipulate these agents into executing detrimental actions against users. Given the potentially severe consequences of such attacks, establishing benchmarks to assess and mitigate these risks is imperative. In this work, we introduce InjecAgent, a benchmark designed to assess the vulnerability of tool-integrated LLM agents to IPI attacks. InjecAgent comprises 1,054 test cases covering 17 different user tools and 62 attacker tools. We categorize attack intentions into two primary types: direct harm to users and exfiltration of private data. We evaluate 30 different LLM agents and show that agents are vulnerable to IPI attacks, with ReAct-prompted GPT-4 vulnerable to attacks 24% of the time. Further investigation into an enhanced setting, where the attacker instructions are reinforced with a hacking prompt, shows additional increases in success rates, nearly doubling the attack success rate on the ReAct-prompted GPT-4. Our findings raise questions about the widespread deployment of LLM Agents. Our benchmark is available at https://github.com/uiuc-kang-lab/InjecAgent.
翻译:近期研究将大语言模型(LLM)具身为智能代理,使其能够调用工具、执行操作并与外部内容(如电子邮件或网站)交互。然而,外部内容引入了间接提示注入(IPI)攻击的风险——恶意指令被嵌入LLM处理的内容中,旨在操控这些代理对用户执行有害操作。鉴于此类攻击可能造成的严重后果,建立基准测试以评估和缓解这些风险势在必行。本研究提出InjecAgent基准,专门评估工具集成型LLM代理对IPI攻击的脆弱性。InjecAgent包含1,054个测试案例,覆盖17种用户工具和62种攻击者工具,将攻击意图分为两类:直接危害用户和窃取隐私数据。我们评估了30种不同的LLM代理,结果显示代理普遍易受IPI攻击,其中采用ReAct提示的GPT-4在24%的测试案例中遭受攻击。进一步实验表明,当攻击指令辅以黑客提示增强时,攻击成功率显著提升——ReAct提示的GPT-4的攻击成功率几乎翻倍。研究结果引发了对LLM代理广泛部署的深刻反思。本基准开源地址:https://github.com/uiuc-kang-lab/InjecAgent