Large Reasoning Models (LRMs) face a distinct safety vulnerability: their internal reasoning chains may generate harmful content even when the final output appears benign. To address this overlooked risk, we first propose a novel attack paradigm, Reasoning-Activated Jailbreak (RAJ) via Concretization, which demonstrates that refining malicious prompts to be more specific can trigger step-by-step logical reasoning that overrides the model's safety protocols. To systematically mitigate this vulnerability, we further develop a scalable framework for constructing high-quality safety alignment datasets. This framework first leverages the RAJ attack to elicit challenging harmful reasoning chains from LRMs, then transforms these high-risk traces into safe, constructive, and educational responses through a tailored Principle-Guided Alignment (PGA) mechanism. Then, we introduce the PGA dataset, a verified alignment dataset containing 3,989 samples using our proposed method. Extensive experiments show that fine-tuning LRMs with PGA dataset significantly enhances model safety, achieving up to a 29.5% improvement in defense success rates across multiple jailbreak benchmarks. Critically, our approach not only defends against sophisticated reasoning-based attacks but also preserves, even enhances, the model's general reasoning capabilities. This work provides a scalable and effective pathway for safety alignment in reasoning-intensive AI systems, addressing the core trade-off between safety and functional performance.
翻译:大型推理模型面临独特的安全脆弱性:即使最终输出看似无害,其内部推理链仍可能生成有害内容。为应对这一被忽视的风险,我们首先提出一种新型攻击范式——通过具体化实现的推理激活越狱。该攻击表明,将恶意提示细化为更具体的表述可触发逐步逻辑推理,从而突破模型的安全防护机制。为系统性地缓解此脆弱性,我们进一步开发了可扩展的高质量安全对齐数据集构建框架。该框架首先利用RAJ攻击从大型推理模型中诱导出具有挑战性的有害推理链,随后通过定制化的原则引导对齐机制,将这些高风险轨迹转化为安全、建设性且具有教育意义的响应。基于此,我们构建了PGA数据集——一个包含3,989个样本的已验证对齐数据集。大量实验表明,使用PGA数据集对大型推理模型进行微调可显著提升模型安全性,在多个越狱基准测试中防御成功率最高提升29.5%。值得注意的是,该方法不仅能防御复杂的基于推理的攻击,还能保持甚至增强模型的通用推理能力。本研究为推理密集型AI系统的安全对齐提供了可扩展且有效的技术路径,从根本上解决了安全性与功能性能之间的核心权衡问题。