AI Kill Switch for malicious web-based LLM agent

Recently, web-based Large Language Model (LLM) agents autonomously perform increasingly complex tasks, thereby bringing significant convenience. However, they also amplify the risks of malicious misuse cases such as unauthorized collection of personally identifiable information (PII), generation of socially divisive content, and even automated web hacking. To address these threats, we propose an AI Kill Switch technique that can immediately halt the operation of malicious web-based LLM agents. To achieve this, we introduce AutoGuard - the key idea is generating defensive prompts that trigger the safety mechanisms of malicious LLM agents. In particular, generated defense prompts are transparently embedded into the website's DOM so that they remain invisible to human users but can be detected by the crawling process of malicious agents, triggering its internal safety mechanisms to abort malicious actions once read. To evaluate our approach, we constructed a dedicated benchmark consisting of three representative malicious scenarios. Experimental results show that AutoGuard achieves over 80% Defense Success Rate (DSR) across diverse malicious agents, including GPT-4o, Claude-4.5-Sonnet and generalizes well to advanced models like GPT-5.1, Gemini-2.5-flash, and Gemini-3-pro. Also, our approach demonstrates robust defense performance in real-world website environments without significant performance degradation for benign agents. Through this research, we demonstrate the controllability of web-based LLM agents, thereby contributing to the broader effort of AI control and safety.

翻译：近年来，基于网络的大型语言模型（LLM）代理能够自主执行日益复杂的任务，从而带来显著便利。然而，它们也放大了恶意滥用行为的风险，例如未经授权收集个人可识别信息（PII）、生成引发社会对立的内容，甚至实施自动化的网络攻击。为应对这些威胁，我们提出一种AI紧急制动技术，能够立即中止恶意网络LLM代理的运行。为实现这一目标，我们引入了AutoGuard——其核心思想是生成能够触发恶意LLM代理安全机制的防御性提示。具体而言，生成的防御提示被透明地嵌入网站的文档对象模型（DOM）中，使其对人类用户不可见，但可被恶意代理的爬取过程检测到。一旦读取这些提示，即可触发代理内部的安全机制，从而中止恶意行为。为评估我们的方法，我们构建了一个包含三种代表性恶意场景的专用基准测试。实验结果表明，AutoGuard在包括GPT-4o、Claude-4.5-Sonnet在内的多种恶意代理上实现了超过80%的防御成功率（DSR），并能良好地泛化至GPT-5.1、Gemini-2.5-flash和Gemini-3-pro等先进模型。同时，我们的方法在真实网站环境中展现出稳健的防御性能，且对良性代理的性能无明显影响。通过本研究，我们论证了网络LLM代理的可控性，从而为更广泛的AI控制与安全研究作出贡献。

相关内容

关注 7109

人工智能杂志AI(Artificial Intelligence)是目前公认的发表该领域最新研究成果的主要国际论坛。该期刊欢迎有关AI广泛方面的论文，这些论文构成了整个领域的进步，也欢迎介绍人工智能应用的论文，但重点应该放在新的和新颖的人工智能方法如何提高应用领域的性能，而不是介绍传统人工智能方法的另一个应用。关于应用的论文应该描述一个原则性的解决方案，强调其新颖性，并对正在开发的人工智能技术进行深入的评估。官网地址：http://dblp.uni-trier.de/db/journals/ai/

人机协同作战规划：来自美海军陆战队的大语言模型（LLM）使用教训

专知会员服务

26+阅读 · 2025年10月16日

大型语言模型（LLM）智能体全栈安全的综述：数据、训练与部署

专知会员服务

33+阅读 · 2025年4月23日

可信赖LLM智能体的研究综述：威胁与应对措施

专知会员服务

36+阅读 · 2025年3月17日

AI在医疗中的安全挑战

专知会员服务

19+阅读 · 2024年10月5日