Ensuring that Large Language Models (LLMs) adhere to safety principles without refusing benign requests remains a significant challenge. While OpenAI introduces deliberative alignment (DA) to enhance the safety of its o-series models through reasoning over detailed ``code-like'' safety rules, the effectiveness of this approach in open-source LLMs, which typically lack advanced reasoning capabilities, is understudied. In this work, we systematically evaluate the impact of explicitly specifying extensive safety codes versus demonstrating them through illustrative cases. We find that referencing explicit codes inconsistently improves harmlessness and systematically degrades helpfulness, whereas training on case-augmented simple codes yields more robust and generalized safety behaviors. By guiding LLMs with case-augmented reasoning instead of extensive code-like safety rules, we avoid rigid adherence to narrowly enumerated rules and enable broader adaptability. Building on these insights, we propose CADA, a case-augmented deliberative alignment method for LLMs utilizing reinforcement learning on self-generated safety reasoning chains. CADA effectively enhances harmlessness, improves robustness against attacks, and reduces over-refusal while preserving utility across diverse benchmarks, offering a practical alternative to rule-only DA for improving safety while maintaining helpfulness.
翻译:确保大语言模型在遵循安全原则的同时不拒绝良性请求,仍然是一个重大挑战。尽管OpenAI通过引入审慎对齐方法,借助对详细“类代码”安全规则的推理来提升其o系列模型的安全性,但该方法在通常缺乏高级推理能力的开源大语言模型中的有效性尚未得到充分研究。在本工作中,我们系统评估了明确指定大量安全规则与通过示例案例进行演示这两种方式的影响。我们发现,引用显式规则在无害性方面的提升并不稳定,且会系统性地降低模型的有用性;而基于案例增强的简明规则进行训练,则能产生更鲁棒且泛化性更强的安全行为。通过采用案例增强推理而非大量类代码安全规则来引导大语言模型,我们避免了模型对狭窄枚举规则的僵化遵循,并实现了更广泛的适应性。基于这些发现,我们提出了CADA,一种面向大语言模型的案例增强型审慎对齐方法,该方法利用强化学习对模型自生成的安全推理链进行优化。CADA有效提升了模型的无害性,增强了对攻击的鲁棒性,减少了过度拒绝行为,同时在多样化基准测试中保持了实用性,为在保持模型有用性的同时提升安全性,提供了一种相较于纯规则审慎对齐方法更实用的替代方案。