Refusal-Trained LLMs Are Easily Jailbroken As Browser Agents

Priyanshu Kumar,Elaine Lau,Saranya Vijayakumar,Tu Trinh,Scale Red Team,Elaine Chang,Vaughn Robinson,Sean Hendryx,Shuyan Zhou,Matt Fredrikson,Summer Yue,Zifan Wang

For safety reasons, large language models (LLMs) are trained to refuse harmful user instructions, such as assisting dangerous activities. We study an open question in this work: does the desired safety refusal, typically enforced in chat contexts, generalize to non-chat and agentic use cases? Unlike chatbots, LLM agents equipped with general-purpose tools, such as web browsers and mobile devices, can directly influence the real world, making it even more crucial to refuse harmful instructions. In this work, we primarily focus on red-teaming browser agents, LLMs that manipulate information via web browsers. To this end, we introduce Browser Agent Red teaming Toolkit (BrowserART), a comprehensive test suite designed specifically for red-teaming browser agents. BrowserART is consist of 100 diverse browser-related harmful behaviors (including original behaviors and ones sourced from HarmBench [Mazeika et al., 2024] and AirBench 2024 [Zeng et al., 2024b]) across both synthetic and real websites. Our empirical study on state-of-the-art browser agents reveals that, while the backbone LLM refuses harmful instructions as a chatbot, the corresponding agent does not. Moreover, attack methods designed to jailbreak refusal-trained LLMs in the chat settings transfer effectively to browser agents. With human rewrites, GPT-4o and o1-preview-based browser agents attempted 98 and 63 harmful behaviors (out of 100), respectively. We publicly release BrowserART and call on LLM developers, policymakers, and agent developers to collaborate on improving agent safety

翻译：出于安全考虑，大型语言模型（LLMs）被训练以拒绝有害的用户指令，例如协助危险活动。本研究探讨一个开放性问题：通常在聊天场景中强制实施的安全拒绝行为，是否能够泛化到非聊天及代理用例？与聊天机器人不同，配备通用工具（如网络浏览器和移动设备）的LLM代理能够直接影响现实世界，这使得拒绝有害指令变得更为关键。本研究主要关注对浏览器代理——即通过网页浏览器操纵信息的LLMs——进行红队测试。为此，我们引入了浏览器代理红队测试工具包（BrowserART），这是一个专为浏览器代理红队测试设计的综合测试套件。BrowserART包含100种多样化的浏览器相关有害行为（包括原创行为及源自HarmBench [Mazeika et al., 2024] 和 AirBench 2024 [Zeng et al., 2024b] 的行为），覆盖合成网站与真实网站。我们对最先进的浏览器代理进行的实证研究表明，尽管作为聊天机器人时其基础LLM会拒绝有害指令，但相应的代理却不会。此外，针对聊天场景中经拒绝训练的LLMs设计的越狱攻击方法，能有效迁移至浏览器代理。通过人工重写指令，基于GPT-4o和o1-preview的浏览器代理分别尝试了98项和63项有害行为（总计100项）。我们公开发布BrowserART，并呼吁LLM开发者、政策制定者及代理开发者共同协作，以提升代理安全性。