Large Reasoning Models (LRMs) have achieved tremendous success with their chain-of-thought (CoT) reasoning, yet also face safety issues similar to those of basic language models. In particular, while algorithms are designed to guide them to deliberately refuse harmful prompts with safe reasoning, this process often fails to generalize against diverse and complex jailbreak attacks. In this work, we attribute these failures to the generalization of the safe reasoning process, particularly their insufficiency against complex attack prompts. We provide both theoretical and empirical evidence to show the necessity of a more sufficient safe reasoning process to defend against advanced attack prompts. Building on this insight, we propose a Risk-Aware Preference Optimization (RAPO) framework that enables LRM to adaptively identify and address the safety risks with appropriate granularity in its thinking content. Extensive experiments demonstrate that RAPO successfully generalizes multiple LRMs' safe reasoning adaptively across diverse attack prompts whilst preserving general utility, contributing a robust alignment technique for LRM safety. Our code is available at https://github.com/weizeming/RAPO.
翻译:大型推理模型(LRM)凭借其思维链(CoT)推理取得了巨大成功,但也面临着与基础语言模型类似的安全问题。具体而言,尽管已有算法被设计用于引导模型通过安全推理过程审慎拒绝有害提示,但这一过程往往难以泛化应对多样且复杂的越狱攻击。本研究将此类失效归因于安全推理过程的泛化能力不足,尤其体现在应对复杂攻击提示时的局限性。我们通过理论分析与实证证据表明,需要构建更充分的安全推理过程以抵御高级攻击提示。基于此洞见,我们提出了风险感知偏好优化(RAPO)框架,使LRM能够自适应地识别安全风险,并在其思维内容中以适当粒度处理这些风险。大量实验表明,RAPO成功实现了多种LRM安全推理在多样化攻击提示下的自适应泛化,同时保持了模型的通用效能,为LRM安全贡献了一种鲁棒的对齐技术。代码已开源:https://github.com/weizeming/RAPO。