Towards Safe Reasoning in Large Reasoning Models via Corrective Intervention

Although Large Reasoning Models (LRMs) have progressed in solving complex problems, their chain-of-thought (CoT) reasoning often contains harmful content that can persist even when the final responses appear safe. We show that this issue still remains in existing methods which overlook the unique significance of safe reasoning, undermining their trustworthiness and posing potential risks in applications if unsafe reasoning is accessible for and exploited by malicious users. We therefore shift our focus to aligning the safety of reasoning itself in this paper and explore process supervision as the solution. However, simply rewarding safe reasoning proves inadequate due to low rollout diversity and limited training signals. To tackle this challenge, we first delve into the characteristics of safe reasoning and uncover several critical insights that 1) safe reasoning is often consolidated by a few critical steps of safety triggers; 2) compliance cues strongly correlate with unsafe continuations; and 3) corrective interventions reliably steer unsafe trajectories towards safer traces. Motivated by these, we propose Intervened Preference Optimization (IPO), an alignment method that enforces safe reasoning by substituting compliance steps with safety triggers and constructing pairs for preference learning with strong signals. Experiments on jailbreak and adversarial safety benchmarks demonstrate that IPO remarkably improves overall safety regarding both reasoning and responses, outperforming SFT-based and RL-based baselines with a relative reduction of over 30% in harmfulness, while preserving excellent performance across diverse reasoning tasks. The results highlight the importance of explicit alignment for reasoning and provide a practical path to safer LRMs.

翻译：尽管大型推理模型（LRMs）在解决复杂问题方面取得了进展，但其思维链（CoT）推理中常常包含有害内容，即使最终回答看起来是安全的，这些有害内容也可能持续存在。我们指出，现有方法仍存在这一问题，它们忽视了安全推理的独特重要性，从而削弱了模型的可信度，并且如果恶意用户能够访问并利用不安全的推理过程，还会在应用中带来潜在风险。因此，本文我们将关注点转向对齐推理过程本身的安全性，并探索将过程监督作为解决方案。然而，仅仅奖励安全推理被证明是不够的，这源于推演过程的多样性不足以及训练信号有限。为了应对这一挑战，我们首先深入研究了安全推理的特征，并揭示了几个关键发现：1）安全推理通常由少数几个关键的安全触发步骤所巩固；2）顺从性提示与不安全的推理延续存在强相关性；3）纠正性干预能够可靠地将不安全的推理轨迹引导至更安全的路径。受此启发，我们提出了干预式偏好优化（Intervened Preference Optimization, IPO），这是一种通过对齐来强制实现安全推理的方法。该方法通过用安全触发步骤替换顺从性步骤，并构建具有强信号偏好的学习样本对。在越狱和对抗性安全基准测试上的实验表明，IPO显著提升了推理和回答的整体安全性，相对于基于监督微调（SFT）和基于强化学习（RL）的基线方法，其有害性相对降低了超过30%，同时在多样化的推理任务上保持了优异的性能。这些结果凸显了对推理过程进行显式对齐的重要性，并为构建更安全的大型推理模型提供了一条实用路径。