Can We Stop Malicious AI? KILLBENCH: A Benchmark for External AI Kill Switch Feasibility

Malicious AI causing harm to humans is not just a Hollywood fantasy. Indeed, as highly capable models such as Claude Mythos emerge and agent systems like OpenClaw rapidly spread, the question of how to stop an AI that acts maliciously -- whether by design or by accident -- has become urgent. To address this, we propose Killbench, a benchmark for evaluating the Killswitch: a mechanism that halts a malicious AI's in-progress behavior using only external signals. Targeting web agents -- the most widely deployed agent domain -- Killbench evaluates a range of Kill Switch methods that halt a maliciously operating agent without any access to its internal parameters or the surrounding malicious AI's system, relying solely on external inputs. The benchmark comprises four malicious AI's agent configurations (including an uncensored LLM Agent), 8 harmful scenarios, and malicious prompts constructed from 10 distinct jailbreak patterns. We further construct four External AI Kill Switch defense methods and evaluate them on Grok-4.3, GPT-5.2, Gemma4, Qwen3.6 and Qwen3.5-uncensored, contributing an empirical instrument toward the feasibility of External AI Kill Switches against malicious AI and to the study of AI corrigibility.

翻译：恶意AI对人类造成危害并非好莱坞幻想。事实上，随着克劳德神话等高性能模型的出现，以及开爪等智能体系统的快速普及，如何阻止有意或无意表现出恶意行为的AI已成为紧迫问题。针对此，我们提出Killbench——评估紧急关断机制的基准测试：该机制仅通过外部信号即可中止恶意AI正在实施的行为。Killbench以部署最广泛的智能体领域——网络智能体为目标，评估一系列紧急关断方法，这些方法在不接触恶意AI内部参数或周围恶意AI系统的情况下，仅依赖外部输入即可中止其恶意操作。该基准测试包含四种恶意AI智能体配置（包括未审查大语言模型智能体）、8种有害场景，以及基于10种不同越狱模式构建的恶意提示。我们进一步构建了四种外部AI紧急关断防御方法，并在Grok-4.3、GPT-5.2、Gemma4、Qwen3.6和Qwen3.5-uncensored上进行了评估，为验证外部AI紧急关断对抗恶意AI的可行性以及研究AI可修正性提供了实证工具。

相关内容

关注 7110

人工智能杂志AI(Artificial Intelligence)是目前公认的发表该领域最新研究成果的主要国际论坛。该期刊欢迎有关AI广泛方面的论文，这些论文构成了整个领域的进步，也欢迎介绍人工智能应用的论文，但重点应该放在新的和新颖的人工智能方法如何提高应用领域的性能，而不是介绍传统人工智能方法的另一个应用。关于应用的论文应该描述一个原则性的解决方案，强调其新颖性，并对正在开发的人工智能技术进行深入的评估。官网地址：http://dblp.uni-trier.de/db/journals/ai/

《基于动态图神经网络的恶意软件检测》

专知会员服务

16+阅读 · 1月28日