Large Reasoning Models (LRMs) have become powerful tools for complex problem solving, but their structured reasoning pathways can lead to unsafe outputs when exposed to harmful prompts. Existing safety alignment methods reduce harmful outputs but can degrade reasoning depth, leading to significant trade-offs in complex, multi-step tasks, and remain vulnerable to sophisticated jailbreak attacks. To address this, we introduce SAFEPATH, a lightweight alignment method that fine-tunes LRMs to emit a short, 8-token Safety Primer at the start of their reasoning, in response to harmful prompts, while leaving the rest of the reasoning process unsupervised. Empirical results across multiple benchmarks indicate that SAFEPATH effectively reduces harmful outputs while maintaining reasoning performance. Specifically, SAFEPATH reduces harmful responses by up to 90.0% and blocks 83.3% of jailbreak attempts in the DeepSeek-R1-Distill-Llama-8B model, while requiring 295.9x less compute than Direct Refusal and 314.1x less than SafeChain. We further introduce a zero-shot variant that requires no fine-tuning. In addition, we provide a comprehensive analysis of how existing methods in LLMs generalize, or fail, when applied to reasoning-centric models, revealing critical gaps and new directions for safer AI.
翻译:大型推理模型已成为解决复杂问题的强大工具,但其结构化的推理路径在接触有害提示时可能导致不安全的输出。现有的安全对齐方法虽然减少了有害输出,但会降低推理深度,导致在复杂多步骤任务中产生显著的性能折衷,并且仍然容易受到复杂的越狱攻击。为解决这一问题,我们提出了SAFEPATH,这是一种轻量级对齐方法,它通过微调大型推理模型,使其在响应有害提示时,在推理开始时生成一个简短的8个标记的安全引导语,而推理过程的其余部分则保持无监督。多个基准测试的实证结果表明,SAFEPATH在有效减少有害输出的同时,保持了推理性能。具体而言,在DeepSeek-R1-Distill-Llama-8B模型中,SAFEPATH将有害响应减少了高达90.0%,并阻断了83.3%的越狱尝试,而其所需的计算量比直接拒绝方法少295.9倍,比SafeChain方法少314.1倍。我们还进一步提出了一种无需微调的零样本变体。此外,我们对现有大型语言模型方法在以推理为中心的模型上的泛化情况(包括成功与失败)进行了全面分析,揭示了关键差距,并为构建更安全的人工智能指出了新的方向。