Fine-tuning well-aligned large language models (LLMs) on new domains often degrades their safety alignment, even when using benign datasets. Existing safety alignment techniques primarily focus on pretraining, leaving fine-tuned models vulnerable to behavioral shifts. In this work, we introduce safety token regularization (STR), a lightweight method designed to preserve safety properties during fine-tuning. Our approach identifies salient tokens from rejection templates of well-aligned models and constrains their associated logits during training, preventing the loss of critical safety behaviors. Unlike reinforcement learning or preference optimization methods, STR requires minimal additional computation and seamlessly integrates with parameter-efficient fine-tuning techniques such as LoRA. Comprehensive experiments demonstrate that our approach achieves safety performance on par with state-of-the-art methods, while preserving task-specific utility and requiring minimal implementation overhead. Furthermore, we show that safety token regularization enhances training stability and overall performance beyond safety considerations alone. This work offers a practical and readily deployable strategy for continual safety alignment in fine-tuned LLMs.
翻译:在新领域微调已经良好对齐的大语言模型(LLM)时,即使使用良性数据集,也常常会降低其安全对齐能力。现有的安全对齐技术主要聚焦于预训练阶段,导致微调后的模型容易发生行为偏移。本文提出安全令牌正则化(STR),一种旨在微调过程中保持安全属性的轻量级方法。该方法从已对齐模型的拒绝模板中识别关键令牌,并在训练期间约束其关联的logit值,从而防止关键安全行为的丧失。与强化学习或偏好优化方法不同,STR仅需极少的额外计算量,并能无缝集成参数高效微调技术(如LoRA)。综合实验表明,本方法在安全性能上与当前最优方法相当,同时保持任务特定实用性且实现开销最低。此外,我们证明安全令牌正则化不仅能提升安全性,还能增强训练稳定性与整体性能。这项工作为持续安全对齐微调大语言模型提供了实用且可直接部署的策略。