Fine-tuning large language models (LLMs) improves performance but introduces critical safety vulnerabilities: even minimal harmful data can severely compromise safety measures. We observe that perturbations orthogonal to the alignment direction - defined by weight differences between aligned (safe) and unaligned models - rapidly compromise model safety. In contrast, updates along the alignment direction largely preserve it, revealing the parameter space as a "narrow safety basin". To address this, we propose AsFT (Anchoring Safety in Fine-Tuning) to maintain safety by explicitly constraining update directions during fine-tuning. By penalizing updates orthogonal to the alignment direction, AsFT effectively constrains the model within the "narrow safety basin," thus preserving its inherent safety. Extensive experiments on multiple datasets and models show that AsFT reduces harmful behaviors by up to 7.60%, improves task performance by 3.44%, and consistently outperforms existing methods across multiple tasks.
翻译:微调大语言模型(LLMs)能提升性能,但会引入关键的安全漏洞:即使是最小量的有害数据也可能严重破坏安全措施。我们观察到,与对齐方向正交的扰动——该方向由对齐(安全)模型与未对齐模型之间的权重差异定义——会迅速损害模型安全性。相比之下,沿对齐方向的更新在很大程度上能保持安全性,这揭示了参数空间呈现为“狭窄安全盆地”。为解决此问题,我们提出AsFT(微调中的安全锚定)方法,通过在微调过程中显式约束更新方向来维持安全性。通过惩罚与对齐方向正交的更新,AsFT有效地将模型约束在“狭窄安全盆地”内,从而保持其内在安全性。在多个数据集和模型上的大量实验表明,AsFT能将有害行为降低达7.60%,将任务性能提升3.44%,并在多项任务中持续优于现有方法。