Reasoning models have demonstrated remarkable capabilities in complex reasoning tasks. However, ensuring their safety against adversarial jailbreak prompts remains a critical challenge. Due to the covert and deceptive nature of such prompts, they can often evade built-in safety mechanisms and lead to the generation of harmful content. This underscores the need for an adaptive safety alignment approach that enables models to autonomously reinforce their defenses in response to adversarial inputs. This paper introduces the Synthesized Guideline-based Adaptive Safety Alignment (SGASA) framework, which internalizes model-generated safety guidelines to strengthen models' ability to enhance robustness against harmful adversarial prompts while minimizing unnecessary refusals of benign requests. SGASA consists of two key stages: Data Pre-synthesis, which generates safety guidelines and augmented prompts; and Alignment Fine-tuning, which leverages Supervised Fine-tuning (SFT) and Direct Preference Optimization (DPO) to embed these guidelines into the model. Extensive experiments across multiple datasets demonstrate that SGASA significantly improves model safety, validating its adaptive and scalable effectiveness.
翻译:推理模型在复杂推理任务中展现出卓越能力。然而,确保其对抗对抗性越狱提示的安全性仍是关键挑战。由于此类提示具有隐蔽性和欺骗性,它们常能规避内置安全机制,导致生成有害内容。这凸显了对自适应安全对齐方法的需求,使模型能够针对对抗性输入自主增强防御能力。本文提出基于合成指南的自适应安全对齐(SGASA)框架,该框架通过内化模型生成的安全指南来增强模型抵御有害对抗性提示的鲁棒性,同时最大限度地减少对良性请求的不必要拒绝。SGASA包含两个关键阶段:数据预合成(生成安全指南与增强提示)和对齐微调(利用监督微调与直接偏好优化将指南嵌入模型)。跨多个数据集的广泛实验表明,SGASA显著提升了模型安全性,验证了其自适应与可扩展的有效性。