As large language models (LLMs) are increasingly deployed in diverse applications, including chatbot assistants and code generation, aligning their behavior with safety and ethical standards has become paramount. However, jailbreak attacks, which exploit vulnerabilities to elicit unintended or harmful outputs, threaten LLMs' safety significantly. In this paper, we introduce Layer-AdvPatcher, a novel methodology designed to defend against jailbreak attacks by utilizing an unlearning strategy to patch specific layers within LLMs through self-augmented datasets. Our insight is that certain layer(s), tend to produce affirmative tokens when faced with harmful prompts. By identifying these layers and adversarially exposing them to generate more harmful data, one can understand their inherent and diverse vulnerabilities to attacks. With these exposures, we then "unlearn" these issues, reducing the impact of affirmative tokens and hence minimizing jailbreak risks while keeping the model's responses to safe queries intact. We conduct extensive experiments on two models, four benchmark datasets, and multiple state-of-the-art jailbreak benchmarks to demonstrate the efficacy of our approach. Results indicate that our framework reduces the harmfulness and attack success rate of jailbreak attacks without compromising utility for benign queries compared to recent defense methods.
翻译:随着大型语言模型(LLM)在聊天机器人助手和代码生成等多样化应用中的部署日益广泛,使其行为符合安全与伦理标准变得至关重要。然而,越狱攻击通过利用模型漏洞诱导其产生非预期或有害输出,严重威胁着LLM的安全性。本文提出Layer-AdvPatcher,一种基于自增强数据集、通过逆向学习策略修补LLM特定层的新型防御方法。我们的核心发现是:面对有害提示时,某些特定层倾向于生成肯定性令牌。通过识别这些层并对其进行对抗性暴露以生成更多有害数据,可以深入理解其内在且多样化的攻击脆弱性。基于这些暴露数据,我们进而对这些漏洞进行“逆向学习”,从而降低肯定性令牌的影响,在保持模型对安全查询响应能力的同时,最大限度降低越狱风险。我们在两种模型、四个基准数据集及多个前沿越狱攻击基准上进行了广泛实验,结果表明:相较于现有防御方法,我们的框架在保持良性查询实用性的前提下,能有效降低越狱攻击的危害程度与成功率。