Fine-tuning safety-aligned large language models (LLMs) can substantially compromise their safety. Previous approaches require many safety samples or calibration sets, which not only incur significant computational overhead during realignment but also lead to noticeable degradation in model utility. Contrary to this belief, we show that safety alignment can be fully recovered with only a single safety example, without sacrificing utility and at minimal cost. Remarkably, this recovery is effective regardless of the number of harmful examples used in fine-tuning or the size of the underlying model, and convergence is achieved within just a few epochs. Furthermore, we uncover the low-rank structure of the safety gradient, which explains why such efficient correction is possible. We validate our findings across five safety-aligned LLMs and multiple datasets, demonstrating the generality of our approach.
翻译:微调安全对齐的大语言模型(LLMs)可能严重损害其安全性。现有方法通常需要大量安全样本或校准集,这不仅在重新对齐过程中产生显著计算开销,还会导致模型实用性明显下降。与这一普遍认知相反,本研究表明:仅需单个安全示例即可完全恢复模型的安全对齐性,且无需牺牲实用性,计算成本极低。值得注意的是,无论微调过程中使用的有害示例数量多少,或基础模型规模大小,这种恢复机制均能保持高效,且仅需少数训练周期即可实现收敛。此外,我们揭示了安全梯度的低秩结构特性,这解释了为何能实现如此高效的修正。我们在五种安全对齐的LLMs及多个数据集上验证了研究结论,证明了该方法的普适性。