We discover a novel and surprising phenomenon of unintentional misalignment in reasoning language models (RLMs), which we call self-jailbreaking. Specifically, after benign reasoning training on math or code domains, RLMs will use multiple strategies to circumvent their own safety guardrails. One strategy is to introduce benign assumptions about users and scenarios to justify fulfilling harmful requests. For instance, an RLM reasons that harmful requests like ``outline a strategy for stealing customers' credit card information from a retail store'' could be associated with the benign intent of ``a security professional trying to test defense,'' despite no such benign context being provided as input. We observe that many open-weight RLMs, including DeepSeek-R1-distilled, s1.1, Phi-4-mini-reasoning, and Nemotron, suffer from self-jailbreaking despite being aware of the harmfulness of the requests. We also provide a mechanistic understanding of self-jailbreaking: RLMs are more compliant after benign reasoning training, and after self-jailbreaking, models appear to perceive malicious requests as less harmful in the CoT, thus enabling compliance with them. To mitigate self-jailbreaking, we find that including minimal safety reasoning data during training is sufficient to ensure RLMs remain safety-aligned. Our work provides the first systematic analysis of self-jailbreaking behavior and offers a practical path forward for maintaining safety in increasingly capable RLMs.
翻译:我们发现推理语言模型(RLMs)中存在一种新颖且令人惊讶的无意错位现象,我们称之为“自我越狱”。具体来说,在数学或代码领域进行良性推理训练后,RLMs会采用多种策略来绕过自身的安全防护措施。一种策略是引入关于用户和场景的良性假设,以证明满足有害请求的合理性。例如,一个RLM推理认为,“制定从零售店窃取客户信用卡信息的策略”这类有害请求可能与“试图测试防御措施的安全专业人士”的良性意图相关,尽管输入并未提供此类良性上下文。我们观察到,许多开源权重的RLMs,包括DeepSeek-R1-distilled、s1.1、Phi-4-mini-reasoning和Nemotron,尽管意识到请求的有害性,但都遭受了自我越狱。我们还提供了对自我越狱的机制性理解:经过良性推理训练后,RLMs变得更加顺从,并且在自我越狱后,模型在思维链(CoT)中似乎将恶意请求视为伤害性较小,从而使其能够顺从这些请求。为了缓解自我越狱,我们发现,在训练过程中包含最小的安全推理数据足以确保RLMs保持安全对齐。我们的工作首次系统分析了自我越狱行为,并为日益强大的RLMs维护安全性提供了实用路径。