Large vision-language model (LVLM)-based web agents are emerging as powerful tools for automating complex online tasks. However, when deployed in real-world environments, they face serious security risks, motivating the design of security evaluation benchmarks. Existing benchmarks provide only partial coverage, typically restricted to narrow scenarios such as user-level prompt manipulation, and thus fail to capture the broad range of agent vulnerabilities. To address this gap, we present \tool{}, the first holistic benchmark for evaluating the security of LVLM-based web agents. \tool{} first introduces a unified evaluation suite comprising six simulated but realistic web environments (\eg, e-commerce platforms, community forums) and includes 2,970 high-quality trajectories spanning diverse tasks and attack settings. The suite defines a structured taxonomy of six attack vectors spanning both user-level and environment-level manipulations. In addition, we introduce a multi-layered evaluation protocol that analyzes agent failures across three critical dimensions: internal reasoning, behavioral trajectory, and task outcome, facilitating a fine-grained risk analysis that goes far beyond simple success metrics. Using this benchmark, we conduct large-scale experiments on 9 representative LVLMs, which fall into three categories: general-purpose, agent-specialized, and GUI-grounded. Our results show that all tested agents are consistently vulnerable to subtle adversarial manipulations and reveal critical trade-offs between model specialization and security. By providing (1) a comprehensive benchmark suite with diverse environments and a multi-layered evaluation pipeline, and (2) empirical insights into the security challenges of modern LVLM-based web agents, \tool{} establishes a foundation for advancing trustworthy web agent deployment.
翻译:基于大型视觉-语言模型(LVLM)的网页智能体正成为自动化复杂在线任务的强大工具。然而,当部署在现实环境中时,它们面临着严重的安全风险,这促使了安全评估基准的设计。现有基准仅提供部分覆盖,通常局限于用户级提示操纵等狭窄场景,因此未能捕捉到智能体漏洞的广泛范围。为弥补这一空白,我们提出了\tool{},这是首个用于评估基于LVLM的网页智能体安全性的整体基准。\tool{}首先引入了一个统一的评估套件,包含六个模拟但真实的网页环境(例如,电子商务平台、社区论坛),并涵盖了跨越多样化任务和攻击设置的2,970条高质量轨迹。该套件定义了一个结构化的六类攻击向量分类法,涵盖用户级和环境级操纵。此外,我们引入了一个多层评估协议,从三个关键维度分析智能体故障:内部推理、行为轨迹和任务结果,从而促进超越简单成功率指标的细粒度风险分析。利用此基准,我们对9个具有代表性的LVLM进行了大规模实验,这些模型分为三类:通用型、智能体专用型和GUI基础型。我们的结果表明,所有测试的智能体都持续容易受到微妙的对抗性操纵,并揭示了模型专业化与安全性之间的关键权衡。通过提供(1)包含多样化环境和多层评估流程的综合基准套件,以及(2)对现代基于LVLM的网页智能体所面临安全挑战的实证见解,\tool{}为推进可信赖网页智能体的部署奠定了基础。