Large reasoning models (LRMs) achieve remarkable performance by leveraging reinforcement learning (RL) on reasoning tasks to generate long chain-of-thought (CoT) reasoning. However, this over-optimization often prioritizes compliance, making models vulnerable to harmful prompts. To mitigate this safety degradation, recent approaches rely on external teacher distillation, yet this introduces a distributional discrepancy that degrades native reasoning. We propose ThinkSafe, a self-generated alignment framework that restores safety alignment without external teachers. Our key insight is that while compliance suppresses safety mechanisms, models often retain latent knowledge to identify harm. ThinkSafe unlocks this via lightweight refusal steering, guiding the model to generate in-distribution safety reasoning traces. Fine-tuning on these self-generated responses effectively realigns the model while minimizing distribution shift. Experiments on DeepSeek-R1-Distill and Qwen3 show ThinkSafe significantly improves safety while preserving reasoning proficiency. Notably, it achieves superior safety and comparable reasoning to GRPO, with significantly reduced computational cost. Code, models, and datasets are available at https://github.com/seanie12/ThinkSafe.git.
翻译:大型推理模型通过利用强化学习在推理任务上生成长链思维推理,取得了显著性能。然而,这种过度优化往往优先考虑指令遵循,使得模型容易受到有害提示的攻击。为了缓解这种安全性退化,现有方法依赖于外部教师蒸馏,但这引入了分布差异,从而损害了模型固有的推理能力。我们提出了ThinkSafe,一种无需外部教师的自生成对齐框架,旨在恢复模型的安全对齐。我们的核心洞见是:尽管指令遵循会抑制安全机制,但模型通常保留着识别危害的潜在知识。ThinkSafe通过轻量级的拒绝引导来解锁这种知识,指导模型生成分布内的安全推理轨迹。基于这些自生成响应进行微调,能有效重新对齐模型,同时最小化分布偏移。在DeepSeek-R1-Distill和Qwen3上的实验表明,ThinkSafe在保持推理能力的同时,显著提升了安全性。值得注意的是,与GRPO相比,ThinkSafe实现了更优的安全性和相当的推理能力,且计算成本显著降低。代码、模型和数据集可在 https://github.com/seanie12/ThinkSafe.git 获取。