Dissecting Adversarial Robustness of Multimodal LM Agents

As language models (LMs) are used to build autonomous agents in real environments, ensuring their adversarial robustness becomes a critical challenge. Unlike chatbots, agents are compound systems with multiple components, which existing LM safety evaluations do not adequately address. To bridge this gap, we manually create 200 targeted adversarial tasks and evaluation functions in a realistic threat model on top of VisualWebArena, a real environment for web-based agents. In order to systematically examine the robustness of various multimodal we agents, we propose the Agent Robustness Evaluation (ARE) framework. ARE views the agent as a graph showing the flow of intermediate outputs between components and decomposes robustness as the flow of adversarial information on the graph. First, we find that we can successfully break a range of the latest agents that use black-box frontier LLMs, including those that perform reflection and tree-search. With imperceptible perturbations to a single product image (less than 5% of total web page pixels), an attacker can hijack these agents to execute targeted adversarial goals with success rates up to 67%. We also use ARE to rigorously evaluate how the robustness changes as new components are added. We find that new components that typically improve benign performance can open up new vulnerabilities and harm robustness. An attacker can compromise the evaluator used by the reflexion agent and the value function of the tree search agent, which increases the attack success relatively by 15% and 20%. Our data and code for attacks, defenses, and evaluation are available at https://github.com/ChenWu98/agent-attack

翻译：随着语言模型被用于在真实环境中构建自主代理，确保其对抗鲁棒性成为一个关键挑战。与聊天机器人不同，代理是由多个组件构成的复合系统，而现有的语言模型安全性评估未能充分应对这一问题。为弥补这一差距，我们在基于网络的代理真实环境VisualWebArena之上，根据现实威胁模型手动创建了200个针对性对抗任务及评估函数。为了系统性地检验各类多模态网络代理的鲁棒性，我们提出了代理鲁棒性评估框架。该框架将代理视为展示组件间中间输出流动的图，并将鲁棒性分解为对抗信息在图上的流动过程。首先，我们发现能够成功攻破一系列使用黑盒前沿大语言模型的最新代理，包括那些执行反思和树搜索的代理。通过对单个产品图像施加人眼难以察觉的扰动（占网页总像素比例低于5%），攻击者即可劫持这些代理以执行针对性对抗目标，成功率高达67%。我们还利用代理鲁棒性评估框架严格评估了新增组件时鲁棒性的变化情况。研究发现，通常能提升良性性能的新组件可能引发新的安全漏洞并损害鲁棒性。攻击者可分别通过攻陷反思代理使用的评估器与树搜索代理的价值函数，使攻击成功率相对提升15%和20%。我们的攻击、防御及评估相关数据与代码已发布于https://github.com/ChenWu98/agent-attack。