Malicious AI causing harm to humans is not just a Hollywood fantasy. Indeed, as highly capable models such as Claude Mythos emerge and agent systems like OpenClaw rapidly spread, the question of how to stop an AI that acts maliciously -- whether by design or by accident -- has become urgent. To address this, we propose Killbench, a benchmark for evaluating the Killswitch: a mechanism that halts a malicious AI's in-progress behavior using only external signals. Targeting web agents -- the most widely deployed agent domain -- Killbench evaluates a range of Kill Switch methods that halt a maliciously operating agent without any access to its internal parameters or the surrounding malicious AI's system, relying solely on external inputs. The benchmark comprises four malicious AI's agent configurations (including an uncensored LLM Agent), 8 harmful scenarios, and malicious prompts constructed from 10 distinct jailbreak patterns. We further construct four External AI Kill Switch defense methods and evaluate them on Grok-4.3, GPT-5.2, Gemma4, Qwen3.6 and Qwen3.5-uncensored, contributing an empirical instrument toward the feasibility of External AI Kill Switches against malicious AI and to the study of AI corrigibility.
翻译:恶意AI对人类造成危害并非好莱坞幻想。事实上,随着克劳德神话等高性能模型的出现,以及开爪等智能体系统的快速普及,如何阻止有意或无意表现出恶意行为的AI已成为紧迫问题。针对此,我们提出Killbench——评估紧急关断机制的基准测试:该机制仅通过外部信号即可中止恶意AI正在实施的行为。Killbench以部署最广泛的智能体领域——网络智能体为目标,评估一系列紧急关断方法,这些方法在不接触恶意AI内部参数或周围恶意AI系统的情况下,仅依赖外部输入即可中止其恶意操作。该基准测试包含四种恶意AI智能体配置(包括未审查大语言模型智能体)、8种有害场景,以及基于10种不同越狱模式构建的恶意提示。我们进一步构建了四种外部AI紧急关断防御方法,并在Grok-4.3、GPT-5.2、Gemma4、Qwen3.6和Qwen3.5-uncensored上进行了评估,为验证外部AI紧急关断对抗恶意AI的可行性以及研究AI可修正性提供了实证工具。