Large Reasoning Models (LRMs) improve task performance through extended inference-time reasoning. While prior work suggests this should strengthen safety, we find evidence to the contrary. Long reasoning sequences can be exploited to systematically weaken them. We introduce Chain-of-Thought Hijacking, a jailbreak attack that prepends harmful instructions with extended sequences of benign puzzle reasoning. Across HarmBench, CoT Hijacking achieves attack success rates of 99\%, 94\%, 100\%, and 94\% on Gemini 2.5 Pro, ChatGPT o4 Mini, Grok 3 Mini, and Claude 4 Sonnet. To understand this mechanism, we apply activation probing, attention analysis, and causal interventions. We find that refusal depends on a low-dimensional safety signal that becomes diluted as reasoning grows: mid-layers encode the strength of safety checking, while late layers encode the refusal outcome. These findings demonstrate that explicit chain-of-thought reasoning introduces a systematic vulnerability when combined with answer-prompting cues. We release all evaluation materials to facilitate replication.
翻译:大型推理模型通过扩展推理时间来提高任务性能。尽管先前研究认为这应增强安全性,但我们发现了相反的证据。长推理序列可被系统性地利用以削弱其安全性。我们提出思维链劫持——一种越狱攻击方法,通过在有害指令前添加长序列的良性谜题推理来实现攻击。在HarmBench基准测试中,思维链劫持对Gemini 2.5 Pro、ChatGPT o4 Mini、Grok 3 Mini和Claude 4 Sonnet的攻击成功率分别达到99%、94%、100%和94%。为解析其机制,我们采用激活探测、注意力分析和因果干预方法。研究发现:拒绝行为依赖于低维安全信号,该信号随推理过程延长而被稀释——中间层编码安全检查强度,而深层编码拒绝结果。这些发现表明,当显式思维链推理与答案提示线索结合时,会引入系统性漏洞。我们已公开全部评估材料以促进复现验证。