Recently, web-based Large Language Model (LLM) agents autonomously perform increasingly complex tasks, thereby bringing significant convenience. However, they also amplify the risks of malicious misuse cases such as unauthorized collection of personally identifiable information (PII), generation of socially divisive content, and even automated web hacking. To address these threats, we propose an AI Kill Switch technique that can immediately halt the operation of malicious web-based LLM agents. To achieve this, we introduce AutoGuard - the key idea is generating defensive prompts that trigger the safety mechanisms of malicious LLM agents. In particular, generated defense prompts are transparently embedded into the website's DOM so that they remain invisible to human users but can be detected by the crawling process of malicious agents, triggering its internal safety mechanisms to abort malicious actions once read. To evaluate our approach, we constructed a dedicated benchmark consisting of three representative malicious scenarios. Experimental results show that AutoGuard achieves over 80% Defense Success Rate (DSR) across diverse malicious agents, including GPT-4o, Claude-4.5-Sonnet and generalizes well to advanced models like GPT-5.1, Gemini-2.5-flash, and Gemini-3-pro. Also, our approach demonstrates robust defense performance in real-world website environments without significant performance degradation for benign agents. Through this research, we demonstrate the controllability of web-based LLM agents, thereby contributing to the broader effort of AI control and safety.
翻译:近年来,基于网络的大型语言模型(LLM)代理能够自主执行日益复杂的任务,从而带来显著便利。然而,它们也放大了恶意滥用行为的风险,例如未经授权收集个人可识别信息(PII)、生成引发社会对立的内容,甚至实施自动化的网络攻击。为应对这些威胁,我们提出一种AI紧急制动技术,能够立即中止恶意网络LLM代理的运行。为实现这一目标,我们引入了AutoGuard——其核心思想是生成能够触发恶意LLM代理安全机制的防御性提示。具体而言,生成的防御提示被透明地嵌入网站的文档对象模型(DOM)中,使其对人类用户不可见,但可被恶意代理的爬取过程检测到。一旦读取这些提示,即可触发代理内部的安全机制,从而中止恶意行为。为评估我们的方法,我们构建了一个包含三种代表性恶意场景的专用基准测试。实验结果表明,AutoGuard在包括GPT-4o、Claude-4.5-Sonnet在内的多种恶意代理上实现了超过80%的防御成功率(DSR),并能良好地泛化至GPT-5.1、Gemini-2.5-flash和Gemini-3-pro等先进模型。同时,我们的方法在真实网站环境中展现出稳健的防御性能,且对良性代理的性能无明显影响。通过本研究,我们论证了网络LLM代理的可控性,从而为更广泛的AI控制与安全研究作出贡献。