LLMs are known to be vulnerable to jailbreak attacks, even after safety alignment. An important observation is that, while different types of jailbreak attacks can generate significantly different queries, they mostly result in similar responses that are rooted in the same harmful knowledge (e.g., detailed steps to make a bomb). Therefore, we conjecture that directly unlearn the harmful knowledge in the LLM can be a more effective way to defend against jailbreak attacks than the mainstream supervised fine-tuning (SFT) based approaches. Our extensive experiments confirmed our insight and suggested surprising generalizability of our unlearning-based approach: using only 20 raw harmful questions \emph{without} any jailbreak prompt during training, our solution reduced the Attack Success Rate (ASR) in Vicuna-7B on \emph{out-of-distribution} (OOD) harmful questions wrapped with various complex jailbreak prompts from 82.6\% to 7.7\%. This significantly outperforms Llama2-7B-Chat, which is fine-tuned on about 0.1M safety alignment samples but still has an ASR of 21.9\% even under the help of an additional safety system prompt. Further analysis reveals that the generalization ability of our solution stems from the intrinsic relatedness among harmful responses across harmful questions (e.g., response patterns, shared steps and actions, and similarity among their learned representations in the LLM). Our code is available at \url{https://github.com/thu-coai/SafeUnlearning}.
翻译:众所周知,即使经过安全对齐,大语言模型(LLMs)仍然容易受到越狱攻击。一个重要观察是:尽管不同类型的越狱攻击可能生成差异显著的查询,但它们大多会产生相似的、根植于相同有害知识(例如制造炸弹的详细步骤)的回复。因此,我们推测,与主流基于监督微调(SFT)的方法相比,直接让大语言模型“遗忘”有害知识可能是防御越狱攻击的更有效途径。我们的大量实验证实了这一洞见,并表明我们基于遗忘的方法具有惊人的泛化能力:在训练中仅使用20个原始有害问题(不包含任何越狱提示),我们的方案将 Vicuna-7B 模型在包裹了各种复杂越狱提示的分布外(OOD)有害问题上的攻击成功率(ASR)从 82.6% 降至 7.7%。这显著优于 Llama2-7B-Chat,后者在约10万个安全对齐样本上进行了微调,并且即使在附加安全系统提示的帮助下,其ASR仍高达 21.9%。进一步分析表明,我们解决方案的泛化能力源于不同有害问题之间有害回复的内在关联性(例如,回复模式、共享的步骤与行动,以及它们在大语言模型中所学表征的相似性)。我们的代码发布于 \url{https://github.com/thu-coai/SafeUnlearning}。