LLMs are known to be vulnerable to jailbreak attacks, even after safety alignment. An important observation is that, while different types of jailbreak attacks can generate significantly different queries, they mostly result in similar responses that are rooted in the same harmful knowledge (e.g., detailed steps to make a bomb). Therefore, we conjecture that directly unlearn the harmful knowledge in the LLM can be a more effective way to defend against jailbreak attacks than the mainstream supervised fine-tuning (SFT) approaches. Our extensive experiments demonstrate the surprising generalizability of our unlearning-based approach: using only 20 raw harmful questions without any jailbreak prompt during training, our solution reduced the Attack Success Rate (ASR) in Vicuna-7B from 82.6% to 7.7% on out-of-distribution (OOD) harmful questions wrapped with various complex jailbreak prompts . This significantly outperforms Llama2-7B-Chat, which is fine-tuned on about 0.1M safety alignment samples but still has an ASR of 21.9% even under the help of an additional safety system prompt. Further analysis reveals that the generalization ability of our solution may stem from the intrinsic relatedness among harmful responses across harmful questions (e.g., response patterns, shared steps and actions in response, and similarity among their learned representations in the LLM). Our code is available at \url{https://github.com/thu-coai/SafeUnlearning}.
翻译:众所周知,即使经过安全对齐,大语言模型(LLMs)仍然容易受到越狱攻击。一个重要观察是,尽管不同类型的越狱攻击可能生成差异显著的查询,但它们大多会引发基于相同有害知识(例如制造炸弹的详细步骤)的相似响应。因此,我们推测,与主流的有监督微调(SFT)方法相比,直接让大语言模型遗忘有害知识可能是防御越狱攻击更有效的方法。我们的大量实验证明了这种基于遗忘的方法具有惊人的泛化能力:在训练过程中仅使用20个原始有害问题且不包含任何越狱提示,我们的解决方案将 Vicuna-7B 在包裹了各种复杂越狱提示的分布外(OOD)有害问题上的攻击成功率(ASR)从 82.6% 降低到了 7.7%。这显著优于 Llama2-7B-Chat,后者在约 0.1M 的安全对齐样本上进行了微调,但即使在额外安全系统提示的帮助下,其 ASR 仍高达 21.9%。进一步的分析表明,我们解决方案的泛化能力可能源于不同有害问题对应的有害响应之间的内在关联性(例如,响应模式、响应中共享的步骤和行动,以及它们在大语言模型中学习到的表征之间的相似性)。我们的代码发布于 \url{https://github.com/thu-coai/SafeUnlearning}。