This study reveals a previously unexplored vulnerability in the safety alignment of Large Language Models (LLMs). Existing aligned LLMs predominantly respond to unsafe queries with refusals, which often begin with a fixed set of prefixes (I'm sorry). We demonstrate that this rigid refusal pattern is a vulnerability and introduce a novel \textbf{refusal unlearning} technique that exploits it. Specifically, we fine-tune LLMs using merely 1,000 benign samples, where each response is prepended with a refusal prefix. The underlying intuition is to disrupt the refusal completion pathway, thereby driving the model to forget how to refuse while following harmful instructions. This intuition is further supported by theoretical proofs. We apply this approach to a total of 16 LLMs, including various open-source models from Llama, Qwen, and Gemma families, as well as closed-source models such as Gemini and GPT. Experimental results show that the safety scores of previously aligned LLMs degrade both consistently and substantially. Importantly, we verify that the observed gain cannot be attributed to plain fine-tuning or random prefix effects. Our findings suggest that current safety alignment may rely heavily on token sequence memorization rather than reasoning, motivating future work beyond simple refusal mechanisms. Code has been released: https://github.com/guoyang9/refusal-unlearning.
翻译:本研究揭示了大语言模型安全对齐中一个先前未被探索的脆弱性。现有已对齐的大语言模型主要通过对不安全查询作出拒绝来回应,这些拒绝通常以固定的前缀集合(如“I'm sorry”)开头。我们证明这种僵化的拒绝模式构成一种安全漏洞,并提出一种利用该漏洞的新型\textbf{拒绝遗忘}技术。具体而言,我们仅使用1000个良性样本对大语言模型进行微调,其中每个响应均以拒绝前缀作为起始。其核心原理在于破坏拒绝补全路径,从而驱动模型在遵循有害指令的同时遗忘如何拒绝。这一原理进一步得到了理论证明的支持。我们将该方法应用于总计16个大语言模型,包括来自Llama、Qwen和Gemma系列的各种开源模型,以及Gemini和GPT等闭源模型。实验结果表明,先前已对齐大语言模型的安全评分均出现持续且显著的下降。重要的是,我们验证了观察到的效果提升并非源于普通微调或随机前缀效应。我们的研究结果表明,当前的安全对齐机制可能严重依赖词元序列记忆而非推理过程,这为超越简单拒绝机制的未来研究提供了动力。代码已发布:https://github.com/guoyang9/refusal-unlearning。