Large language models are susceptible to jailbreak attacks, which can result in the generation of harmful content. While prior defenses mitigate these risks by perturbing or inspecting inputs, they ignore competing objectives, the underlying cause of alignment failures. In this paper, we propose Alignment-Enhanced Decoding (AED), a novel defense that employs adaptive decoding to address the root causes of jailbreak issues. We first define the Competitive Index to quantify alignment failures and utilize feedback from self-evaluation to compute post-alignment logits. Then, AED adaptively combines AED and post-alignment logits with the original logits to obtain harmless and helpful distributions. Consequently, our method enhances safety alignment while maintaining helpfulness. We conduct experiments across five models and four common jailbreaks, with the results validating the effectiveness of our approach. Code is available at https://github.com/GIGABaozi/AED.git.
翻译:大型语言模型容易受到越狱攻击,导致生成有害内容。现有防御方法主要通过扰动或检查输入来降低风险,却忽视了导致对齐失败的根本原因——竞争性目标。本文提出对齐增强解码(AED),一种采用自适应解码机制的新型防御方法,旨在从根源上解决越狱问题。我们首先定义竞争指数以量化对齐失败程度,并利用自评估反馈计算对齐后逻辑值。随后,AED 将原始逻辑值与对齐后逻辑值进行自适应融合,从而获得兼具无害性与实用性的概率分布。该方法在增强安全对齐的同时保持了模型实用性。我们在五种模型和四种常见越狱攻击上进行了实验,结果验证了本方法的有效性。代码发布于 https://github.com/GIGABaozi/AED.git。