With the fast development of large language models (LLMs), LLM-driven Web Agents (Web Agents for short) have obtained tons of attention due to their superior capability where LLMs serve as the core part of making decisions like the human brain equipped with multiple web tools to actively interact with external deployed websites. As uncountable Web Agents have been released and such LLM systems are experiencing rapid development and drawing closer to widespread deployment in our daily lives, an essential and pressing question arises: "Are these Web Agents secure?". In this paper, we introduce a novel threat, WIPI, that indirectly controls Web Agent to execute malicious instructions embedded in publicly accessible webpages. To launch a successful WIPI works in a black-box environment. This methodology focuses on the form and content of indirect instructions within external webpages, enhancing the efficiency and stealthiness of the attack. To evaluate the effectiveness of the proposed methodology, we conducted extensive experiments using 7 plugin-based ChatGPT Web Agents, 8 Web GPTs, and 3 different open-source Web Agents. The results reveal that our methodology achieves an average attack success rate (ASR) exceeding 90% even in pure black-box scenarios. Moreover, through an ablation study examining various user prefix instructions, we demonstrated that the WIPI exhibits strong robustness, maintaining high performance across diverse prefix instructions.
翻译:随着大语言模型的快速发展,以大语言模型为核心的网络代理(简称Web Agent)凭借其卓越能力获得了广泛关注——这些代理如同配备多种网络工具的人脑,能够主动与外部部署的网站进行交互。随着不计其数的网络代理被发布,此类大语言模型系统正经历快速发展并逐步接近日常生活中的大规模部署,一个关键且紧迫的问题随之浮现:“这些网络代理是否安全?”本文提出了一种名为WIPI的新型威胁,该威胁通过间接控制网络代理,使其执行嵌入在公开可访问网页中的恶意指令。为实现成功的WIPI攻击,该方法在黑盒环境下运行,专注于外部网页中间接指令的形式与内容,从而提升攻击的效率与隐蔽性。为评估所提方法的有效性,我们使用7个基于插件的ChatGPT网络代理、8个Web GPT以及3个不同的开源网络代理进行了大量实验。结果表明,即使在纯黑盒场景下,我们的方法仍能达到超过90%的平均攻击成功率(ASR)。此外,通过针对不同用户前缀指令的消融研究,我们证明了WIPI具备强大的鲁棒性,能够在多种前缀指令下保持高性能。