MUZZLE: Adaptive Agentic Red-Teaming of Web Agents Against Indirect Prompt Injection Attacks

Large language model (LLM) based web agents are increasingly deployed to automate complex online tasks by directly interacting with web sites and performing actions on users' behalf. While these agents offer powerful capabilities, their design exposes them to indirect prompt injection attacks embedded in untrusted web content, enabling adversaries to hijack agent behavior and violate user intent. Despite growing awareness of this threat, existing evaluations rely on fixed attack templates, manually selected injection surfaces, or narrowly scoped scenarios, limiting their ability to capture realistic, adaptive attacks encountered in practice. We present MUZZLE, an automated agentic framework for evaluating the security of web agents against indirect prompt injection attacks. MUZZLE utilizes the agent's trajectories to automatically identify high-salience injection surfaces, and adaptively generate context-aware malicious instructions that target violations of confidentiality, integrity, and availability. Unlike prior approaches, MUZZLE adapts its attack strategy based on the agent's observed execution trajectory and iteratively refines attacks using feedback from failed executions. We evaluate MUZZLE across diverse web applications, user tasks, and agent configurations, demonstrating its ability to automatically and adaptively assess the security of web agents with minimal human intervention. Our results show that MUZZLE effectively discovers 37 new attacks on 4 web applications with 10 adversarial objectives that violate confidentiality, availability, or privacy properties. MUZZLE also identifies novel attack strategies, including 2 cross-application prompt injection attacks and an agent-tailored phishing scenario.

翻译：基于大语言模型（LLM）的网络代理正被日益广泛地部署，通过直接与网站交互并代表用户执行操作来自动化复杂的在线任务。尽管这些代理提供了强大的能力，但其设计使其容易受到嵌入在不可信网页内容中的间接提示注入攻击，从而使攻击者能够劫持代理行为并违背用户意图。尽管对这一威胁的认识不断增强，但现有评估方法依赖于固定的攻击模板、手动选择的注入点或范围狭窄的场景，限制了其捕捉实践中遇到的真实自适应攻击的能力。我们提出了MUZZLE，一个用于评估网络代理抵御间接提示注入攻击安全性的自动化智能体框架。MUZZLE利用代理的执行轨迹自动识别高显著性的注入点，并自适应地生成针对机密性、完整性和可用性违规的上下文感知恶意指令。与先前方法不同，MUZZLE根据观察到的代理执行轨迹自适应调整攻击策略，并利用失败执行的反馈迭代优化攻击。我们在多样化的网络应用、用户任务和代理配置中评估MUZZLE，证明了其能够以最少的人工干预自动且自适应地评估网络代理的安全性。我们的结果表明，MUZZLE在4个网络应用中针对10个违反机密性、可用性或隐私属性的对抗性目标，有效发现了37种新型攻击。MUZZLE还识别了新颖的攻击策略，包括2种跨应用提示注入攻击和一种针对代理定制的钓鱼场景。