Large language models require continuous adaptation to new tasks while preserving safety alignment. However, fine-tuning on even benign data often compromises safety behaviors, including refusal of harmful requests, truthfulness, and commonsense reasoning. We investigate which training samples cause alignment drift through a data-centric lens. Our empirical analysis shows samples contribute unequally: high-gradient samples cause greater safety degradation and drive models toward pretrained distributions, while moderate-gradient samples enable task learning with minimal alignment loss. We propose gradient-based sample selection that filters high-gradient samples during fine-tuning. Across multiple model families on continual domain tasks, our method substantially improves alignment preservation while maintaining competitive task performance, without requiring curated safe data or architectural modifications. Our method is robust across selection ratios, task orderings, and diverse attack benchmarks.
翻译:大语言模型需要在保持安全对齐的同时持续适应新任务。然而,即使在良性数据上进行微调,也常会损害安全行为,包括对有害请求的拒绝、真实性和常识推理。我们通过数据中心的视角研究哪些训练样本会导致对齐漂移。实证分析表明,样本贡献不均等:高梯度样本导致更大的安全退化,并将模型推向预训练分布,而中等梯度样本能在对齐损失最小的情况下实现任务学习。我们提出基于梯度的样本选择方法,在微调过程中过滤高梯度样本。在多个模型家族的持续领域任务上,该方法在保持竞争性任务性能的同时,显著提升了对齐保持能力,且无需精心策划的安全数据或架构修改。该方法在选择比例、任务顺序和多种攻击基准测试中均表现出鲁棒性。