MUZZLE: Adaptive Agentic Red-Teaming of Web Agents Against Indirect Prompt Injection Attacks

Large language model (LLM) based web agents are increasingly deployed to automate complex online tasks by directly interacting with web sites and performing actions on users' behalf. While these agents offer powerful capabilities, their design exposes them to indirect prompt injection attacks embedded in untrusted web content, enabling adversaries to hijack agent behavior and violate user intent. Despite growing awareness of this threat, existing evaluations rely on fixed attack templates, manually selected injection surfaces, or narrowly scoped scenarios, limiting their ability to capture realistic, adaptive attacks encountered in practice. We present MUZZLE, an automated agentic framework for evaluating the security of web agents against indirect prompt injection attacks. MUZZLE utilizes the agent's trajectories to automatically identify high-salience injection surfaces, and adaptively generate context-aware malicious instructions that target violations of confidentiality, integrity, and availability. Unlike prior approaches, MUZZLE adapts its attack strategy based on the agent's observed execution trajectory and iteratively refines attacks using feedback from failed executions. We evaluate MUZZLE across diverse web applications, user tasks, and agent configurations, demonstrating its ability to automatically and adaptively assess the security of web agents with minimal human intervention. Our results show that MUZZLE effectively discovers 44 new attacks on 4 web applications with 10 adversarial objectives that violate confidentiality, availability, or privacy properties across different LLMs and agent scaffolds. MUZZLE also identifies novel attack strategies, including 3 cross-application prompt injection attacks and an agent-tailored phishing scenario.

翻译：基于大型语言模型（LLM）的网络智能体正通过直接与网站交互并代表用户执行操作，日益广泛地用于自动化复杂在线任务。尽管这些智能体提供了强大的功能，但其设计使其容易受到嵌入在不可信网络内容中的间接提示注入攻击，从而使攻击者能够劫持智能体的行为并违背用户意图。尽管对这一威胁的认识日益增强，但现有评估依赖于固定的攻击模板、手动选择的注入面或范围狭窄的场景，限制了其捕捉实际中遇到的真实、自适应攻击的能力。我们提出MUZZLE，一种用于评估网络智能体抵御间接提示注入攻击安全性的自动化智能体框架。MUZZLE利用智能体的轨迹自动识别高显著性注入面，并自适应地生成针对保密性、完整性和可用性破坏的上下文感知恶意指令。与先前方法不同，MUZZLE根据观察到的智能体执行轨迹调整其攻击策略，并利用失败执行的反馈迭代改进攻击。我们在多样化的网络应用、用户任务和智能体配置上评估了MUZZLE，展示了其以最少人工干预自动且自适应地评估网络智能体安全性的能力。我们的结果表明，MUZZLE能够针对4个网络应用、10个破坏不同LLM和智能体框架的保密性、可用性或隐私属性的对抗目标，有效发现44种新型攻击。MUZZLE还识别了多种新型攻击策略，包括3种跨应用提示注入攻击和一种针对智能体定制的钓鱼场景。