Harmful fine-tuning can invalidate safety alignment of large language models, exposing significant safety risks. In this paper, we utilize the attention sink mechanism to mitigate harmful fine-tuning. Specifically, we first measure a statistic named \emph{sink divergence} for each attention head and observe that \emph{different attention heads exhibit two different signs of sink divergence}. To understand its safety implications, we conduct experiments and find that the number of attention heads of positive sink divergence increases along with the increase of the model's harmfulness when undergoing harmful fine-tuning. Based on this finding, we propose a separable sink divergence hypothesis -- \emph{attention heads associating with learning harmful patterns during fine-tuning are separable by their sign of sink divergence}. Based on the hypothesis, we propose a fine-tuning-stage defense, dubbed Surgery. Surgery utilizes a regularizer for sink divergence suppression, which steers attention heads toward the negative sink divergence group, thereby reducing the model's tendency to learn and amplify harmful patterns. Extensive experiments demonstrate that Surgery improves defense performance by 5.90\%, 11.25\%, and 9.55\% on the BeaverTails, HarmBench, and SorryBench benchmarks, respectively. Source code is available on https://github.com/Lslland/Surgery.
翻译:有害微调会破坏大型语言模型的安全对齐机制,引发显著的安全风险。本文利用注意力汇聚机制来缓解有害微调。具体而言,我们首先为每个注意力头计算名为“汇聚散度”的统计量,并观察到不同注意力头呈现出两种不同的汇聚散度符号。为探究其安全影响,我们通过实验发现,在进行有害微调时,具有正汇聚散度的注意力头数量随模型危害性增强而增加。基于此发现,我们提出可分离汇聚散度假说——在微调过程中学习有害模式的注意力头可通过其汇聚散度符号实现分离。基于该假说,我们提出一种微调阶段防御方法,命名为Surgery。该方法采用汇聚散度抑制正则器,将注意力头导向负汇聚散度组,从而降低模型学习与放大有害模式的倾向。大量实验表明,Surgery在BeaverTails、HarmBench和SorryBench基准测试上的防御性能分别提升了5.90%、11.25%和9.55%。源代码发布于https://github.com/Lslland/Surgery。