While Large Reasoning Models (LRMs) excel at complex tasks, they remain highly vulnerable to sophisticated jailbreaks and direct harmful queries. To address this vulnerability, prior works depend heavily on external manual data annotation for safety alignment. However, we observe that LRMs can inherently identify safety risks when being re-presented with original queries alongside their own reasoning trajectories -- a capability we term Latent Safety Awareness. To leverage this safety awareness, we first employ Supervised Fine-Tuning (SFT) to explicitly induce safe tags to trigger safety analysis and guidance following the initial reasoning content for unsafe queries, while preserving standard responses for general queries to ensure adaptive triggering. Subsequently, we apply Direct Preference Optimization (DPO) to further enhance the correctness and stability of the safety analysis and guidance. Notably, responses required for both training stages are entirely generated by models being optimized. With (Safe Trigger) SFT and DPO, experimental results demonstrate significant safety enhancement. For example, the Attack Success Rate (ASR) of DeepSeek-R1-Distill-Llama-8B, on average, drops 24.65% and 36.72% on harmful and jailbreak benchmarks, respectively. Finally, our Safe Trigger method exerts almost no negative impact on general performance or user experience.
翻译:尽管大型推理模型(LRMs)在复杂任务上表现出色,但它们仍极易受到高级越狱攻击和直接有害查询的影响。为解决这一漏洞,先前的工作严重依赖外部人工数据标注来实现安全对齐。然而,我们观察到,当重新向LRMs呈现原始查询及其自身推理轨迹时,它们能够内在地识别安全风险——我们将这种能力称为潜在安全意识。为利用这种安全意识,我们首先采用监督微调(SFT)显式引入安全标签,在不安全查询的初始推理内容后触发安全分析与指导,同时为通用查询保留标准响应以实现自适应触发。随后,我们应用直接偏好优化(DPO)进一步增强安全分析与指导的正确性和稳定性。值得注意的是,两个训练阶段所需的响应完全由被优化的模型自身生成。通过(安全触发器)SFT和DPO,实验结果表明安全性显著提升。例如,DeepSeek-R1-Distill-Llama-8B在有害和越狱基准上的攻击成功率(ASR)平均分别下降24.65%和36.72%。最后,我们的安全触发器方法对通用性能或用户体验几乎无负面影响。