AdvWeb: Controllable Black-box Attacks on VLM-powered Web Agents

Vision Language Models (VLMs) have revolutionized the creation of generalist web agents, empowering them to autonomously complete diverse tasks on real-world websites, thereby boosting human efficiency and productivity. However, despite their remarkable capabilities, the safety and security of these agents against malicious attacks remain critically underexplored, raising significant concerns about their safe deployment. To uncover and exploit such vulnerabilities in web agents, we provide AdvWeb, a novel black-box attack framework designed against web agents. AdvWeb trains an adversarial prompter model that generates and injects adversarial prompts into web pages, misleading web agents into executing targeted adversarial actions such as inappropriate stock purchases or incorrect bank transactions, actions that could lead to severe real-world consequences. With only black-box access to the web agent, we train and optimize the adversarial prompter model using DPO, leveraging both successful and failed attack strings against the target agent. Unlike prior approaches, our adversarial string injection maintains stealth and control: (1) the appearance of the website remains unchanged before and after the attack, making it nearly impossible for users to detect tampering, and (2) attackers can modify specific substrings within the generated adversarial string to seamlessly change the attack objective (e.g., purchasing stocks from a different company), enhancing attack flexibility and efficiency. We conduct extensive evaluations, demonstrating that AdvWeb achieves high success rates in attacking SOTA GPT-4V-based VLM agent across various web tasks. Our findings expose critical vulnerabilities in current LLM/VLM-based agents, emphasizing the urgent need for developing more reliable web agents and effective defenses. Our code and data are available at https://ai-secure.github.io/AdvWeb/ .

翻译：视觉语言模型（VLM）彻底改变了通用Web代理的开发，使其能够在真实网站中自主完成多样化任务，从而显著提升人类效率与生产力。然而，尽管这些代理具备卓越能力，其面对恶意攻击时的安全性与鲁棒性仍亟待深入探究，这对其安全部署构成了重大隐患。为揭示并利用Web代理中的此类漏洞，我们提出了AdvWeb——一种针对Web代理的新型黑盒攻击框架。AdvWeb通过训练对抗性提示生成模型，向网页中注入对抗性提示，误导Web代理执行目标对抗行为（例如不当的股票购买或错误的银行转账），这些行为可能导致严重的现实后果。在仅具备黑盒访问权限的条件下，我们利用DPO算法，通过目标代理对成功与失败攻击字符串的反馈来训练和优化对抗性提示生成模型。与现有方法不同，我们的对抗性字符串注入具备隐蔽性与可控性：（1）攻击前后网站视觉呈现保持不变，用户几乎无法察觉篡改痕迹；（2）攻击者可通过修改生成对抗字符串中的特定子串，无缝切换攻击目标（例如改为购买其他公司的股票），从而提升攻击的灵活性与效率。我们进行了广泛评估，结果表明AdvWeb在多种Web任务中对基于GPT-4V的先进VLM代理均能实现高成功率攻击。本研究揭示了当前基于LLM/VLM的代理存在的关键安全漏洞，强调了开发更可靠Web代理及有效防御机制的紧迫性。相关代码与数据已公开于https://ai-secure.github.io/AdvWeb/。