Despite extensive safety alignment, Large Language Models (LLMs) often fail against jailbreak attacks. While machine unlearning has emerged as a promising defense by erasing specific harmful parameters, current methods remain vulnerable to diverse jailbreaks. We first conduct an empirical study and discover that this failure mechanism is caused by jailbreaks primarily activating non-erased parameters in the intermediate layers. Further, by probing the underlying mechanism through which these circumvented parameters reassemble into the prohibited output, we verify the persistent existence of dynamic $\textbf{jailbreak paths}$ and show that the inability to rectify them constitutes the fundamental gap in existing unlearning defenses. To bridge this gap, we propose $\textbf{J}$ailbreak $\textbf{P}$ath $\textbf{U}$nlearning (JPU), which is the first to rectify dynamic jailbreak paths towards safety anchors by dynamically mining on-policy adversarial samples to expose vulnerabilities and identify jailbreak paths. Extensive experiments demonstrate that JPU significantly enhances jailbreak resistance against dynamic attacks while preserving the model's utility.
翻译:尽管进行了广泛的安全对齐,大型语言模型(LLMs)在面对越狱攻击时仍常常失效。虽然机器遗忘作为一种有前景的防御手段,通过擦除特定的有害参数来应对,但现有方法在面对多样化的越狱攻击时依然脆弱。我们首先进行了一项实证研究,发现这种失效机制源于越狱攻击主要激活了中间层中未被擦除的参数。进一步地,通过探究这些被绕过的参数如何重组为被禁止输出的内在机制,我们验证了动态 $\textbf{越狱路径}$ 的持续存在,并表明无法修正这些路径构成了现有遗忘防御的根本缺陷。为弥合这一差距,我们提出了 $\textbf{J}$ailbreak $\textbf{P}$ath $\textbf{U}$nlearning (JPU),这是首个通过动态挖掘策略对抗样本来暴露漏洞、识别越狱路径,从而将动态越狱路径修正至安全锚点的方法。大量实验表明,JPU 在保持模型实用性的同时,显著增强了对动态攻击的越狱抵抗能力。