SAFEPATH: Preventing Harmful Reasoning in Chain-of-Thought via Early Alignment

Large Reasoning Models (LRMs) have become powerful tools for complex problem solving, but their structured reasoning pathways can lead to unsafe outputs when exposed to harmful prompts. Existing safety alignment methods reduce harmful outputs but can degrade reasoning depth, leading to significant trade-offs in complex, multi-step tasks, and remain vulnerable to sophisticated jailbreak attacks. To address this, we introduce SAFEPATH, a lightweight alignment method that fine-tunes LRMs to emit a short, 8-token Safety Primer at the start of their reasoning, in response to harmful prompts, while leaving the rest of the reasoning process unsupervised. Empirical results across multiple benchmarks indicate that SAFEPATH effectively reduces harmful outputs while maintaining reasoning performance. Specifically, SAFEPATH reduces harmful responses by up to 90.0% and blocks 83.3% of jailbreak attempts in the DeepSeek-R1-Distill-Llama-8B model, while requiring 295.9x less compute than Direct Refusal and 314.1x less than SafeChain. We further introduce a zero-shot variant that requires no fine-tuning. In addition, we provide a comprehensive analysis of how existing methods in LLMs generalize, or fail, when applied to reasoning-centric models, revealing critical gaps and new directions for safer AI.

翻译：大型推理模型已成为解决复杂问题的强大工具，但其结构化的推理路径在接触有害提示时可能导致不安全的输出。现有的安全对齐方法虽然减少了有害输出，但会降低推理深度，导致在复杂多步骤任务中产生显著的性能折衷，并且仍然容易受到复杂的越狱攻击。为解决这一问题，我们提出了SAFEPATH，这是一种轻量级对齐方法，它通过微调大型推理模型，使其在响应有害提示时，在推理开始时生成一个简短的8个标记的安全引导语，而推理过程的其余部分则保持无监督。多个基准测试的实证结果表明，SAFEPATH在有效减少有害输出的同时，保持了推理性能。具体而言，在DeepSeek-R1-Distill-Llama-8B模型中，SAFEPATH将有害响应减少了高达90.0%，并阻断了83.3%的越狱尝试，而其所需的计算量比直接拒绝方法少295.9倍，比SafeChain方法少314.1倍。我们还进一步提出了一种无需微调的零样本变体。此外，我们对现有大型语言模型方法在以推理为中心的模型上的泛化情况（包括成功与失败）进行了全面分析，揭示了关键差距，并为构建更安全的人工智能指出了新的方向。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

【CHI2020-微软】解释可解释性:理解数据科学家使用机器学习的可解释性工具，Interpreting Interpretability: Understanding Data Scientists’Use of Interpretability Tools for Machine Learning

专知会员服务

55+阅读 · 2020年3月8日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日