Large language models (LLMs) are vital for a wide range of applications yet remain susceptible to jailbreak threats, which could lead to the generation of inappropriate responses. Conventional defenses, such as refusal and adversarial training, often fail to cover corner cases or rare domains, leaving LLMs still vulnerable to more sophisticated attacks. We propose a novel defense strategy, Safety Chain-of-Thought (SCoT), which harnesses the enhanced \textit{reasoning capabilities} of LLMs for proactive assessment of harmful inputs, rather than simply blocking them. SCoT augments any refusal training datasets to critically analyze the intent behind each request before generating answers. By employing proactive reasoning, SCoT enhances the generalization of LLMs across varied harmful queries and scenarios not covered in the safety alignment corpus. Additionally, it generates detailed refusals specifying the rules violated. Comparative evaluations show that SCoT significantly surpasses existing defenses, reducing vulnerability to out-of-distribution issues and adversarial manipulations while maintaining strong general capabilities.
翻译:大型语言模型(LLMs)在众多应用中至关重要,但仍易受越狱威胁的影响,可能导致生成不当回复。传统防御方法(如拒绝式响应和对抗训练)往往难以覆盖边缘案例或罕见领域,使得LLMs在面对更复杂攻击时依然脆弱。我们提出一种新型防御策略——安全思维链(SCoT),该策略利用LLMs增强的推理能力对有害输入进行主动评估,而非简单拦截。SCoT通过扩展现有拒绝训练数据集,在生成答案前批判性分析每个请求背后的意图。借助主动推理机制,SCoT显著提升了LLMs对安全对齐语料库未涵盖的各类有害查询和场景的泛化能力。同时,该方法能生成详细说明违规规则的拒绝响应。对比实验表明,SCoT在保持强大通用能力的同时,显著优于现有防御方案,有效降低了模型对分布外问题和对抗性操作的脆弱性。