Multimodal web agents that process both screenshots and accessibility trees are increasingly deployed to interact with web interfaces, yet their dual-stream architecture opens an underexplored attack surface: an adversary who injects content into the webpage DOM simultaneously corrupts both observation channels with a consistent deceptive narrative. Our vulnerability analysis on MiniWob++ reveals that attacks including a visual component far outperform text-only injections, exposing critical gaps in text-centric VLM safety training. Motivated by this finding, we propose Dual-Modality Multi-Stage Adversarial Safety Training (DMAST), a framework that formalizes the agent-attacker interaction as a two-player zero-sum Markov game and co-trains both players through a three-stage pipeline: (1) imitation learning from a strong teacher model, (2) oracle-guided supervised fine-tuning that uses a novel zero-acknowledgment strategy to instill task-focused reasoning under adversarial noise, and (3) adversarial reinforcement learning via Group Relative Policy Optimization (GRPO) self-play. On out-of-distribution tasks, DMAST substantially mitigates adversarial risks while simultaneously doubling task completion efficiency. Our approach significantly outperforms established training-based and prompt-based defenses, demonstrating genuine co-evolutionary progress and robust generalization to complex, unseen environments.
翻译:处理屏幕截图和可访问性树的多模态网络代理正越来越多地被部署以与网络界面交互,然而其双流架构开辟了一个尚未被充分探索的攻击面:攻击者向网页DOM中注入内容,会同时以一致的欺骗性叙述污染两个观察通道。我们在MiniWob++上的脆弱性分析表明,包含视觉组件的攻击效果远超纯文本注入,暴露了以文本为中心的视觉语言模型安全训练中的关键缺陷。基于这一发现,我们提出了双模态多阶段对抗性安全训练,该框架将代理与攻击者的交互形式化为一个双人零和马尔可夫博弈,并通过一个三阶段流程协同训练双方:(1) 从强教师模型进行模仿学习,(2) 采用一种新颖的零确认策略进行先知引导的监督微调,以在对抗性噪声下灌输任务聚焦的推理能力,(3) 通过基于组相对策略优化的自我对抗进行对抗性强化学习。在分布外任务上,DMAST显著降低了对抗性风险,同时将任务完成效率提升了一倍。我们的方法显著优于既有的基于训练和基于提示的防御方法,展现了真正的协同进化进展以及对复杂、未见环境的鲁棒泛化能力。