Harmful fine-tuning issues present significant safety challenges for fine-tuning-as-a-service in large language models. Existing alignment-stage defenses, e.g., Vaccine, Repnoise, Booster, and T-Vaccine, mitigate harmful fine-tuning issues by enhancing the model's robustness during the alignment phase. While these methods have been proposed to mitigate the issue, they often overlook a critical upstream factor: the role of the original safety-alignment data. We observe that their defense performance and computational efficiency remain constrained by the quality and composition of the alignment dataset. To address this limitation, we propose Pharmacist, a safety alignment data curation solution that enhances defense against harmful fine-tuning by selecting a high-quality and safety-critical core subset from the original alignment data. The core idea of Pharmacist is to train an alignment data selector to rank alignment data. Specifically, up-ranking high-quality and safety-critical alignment data, down-ranking low-quality and non-safety-critical data. Empirical results indicate that models trained on datasets selected by Pharmacist outperform those trained on datasets selected by existing selection methods in both defense and inference performance. In addition, Pharmacist can be effectively integrated with mainstream alignment-stage defense methods. For example, when applied to RepNoise and T-Vaccine, using the dataset selected by Pharmacist instead of the full dataset leads to improvements in defense performance by 2.60\% and 3.30\%, respectively, and enhances inference performance by 3.50\% and 1.10\%. Notably, it reduces training time by 56.83\% and 57.63\%, respectively. Our code is available at https://github.com/Lslland/Pharmacist.
翻译:有害微调问题对大型语言模型的微调即服务构成了重大的安全挑战。现有的对齐阶段防御方法,例如Vaccine、Repnoise、Booster和T-Vaccine,通过在模型对齐阶段增强其鲁棒性来缓解有害微调问题。尽管这些方法已被提出以缓解该问题,但它们往往忽视了一个关键的上游因素:原始安全对齐数据的作用。我们观察到,其防御性能和计算效率仍然受到对齐数据集质量和构成的限制。为应对这一局限,我们提出了药剂师(Pharmacist),一种安全对齐数据精选解决方案,它通过从原始对齐数据中筛选出高质量且安全关键的核心子集,来增强针对有害微调的防御能力。药剂师的核心思想是训练一个对齐数据选择器来对对齐数据进行排序。具体而言,提升高质量且安全关键的对齐数据的排名,降低低质量且非安全关键数据的排名。实证结果表明,在通过药剂师筛选的数据集上训练的模型,在防御性能和推理性能上均优于使用现有选择方法筛选的数据集训练的模型。此外,药剂师能够有效地与主流对齐阶段防御方法相结合。例如,当应用于RepNoise和T-Vaccine时,使用药剂师筛选的数据集替代完整数据集,分别使防御性能提升了2.60%和3.30%,并使推理性能提升了3.50%和1.10%。值得注意的是,它分别将训练时间减少了56.83%和57.63%。我们的代码可在 https://github.com/Lslland/Pharmacist 获取。