Pharmacist: Safety Alignment Data Curation for Large Language Models against Harmful Fine-tuning

Harmful fine-tuning issues present significant safety challenges for fine-tuning-as-a-service in large language models. Existing alignment-stage defenses, e.g., Vaccine, Repnoise, Booster, and T-Vaccine, mitigate harmful fine-tuning issues by enhancing the model's robustness during the alignment phase. While these methods have been proposed to mitigate the issue, they often overlook a critical upstream factor: the role of the original safety-alignment data. We observe that their defense performance and computational efficiency remain constrained by the quality and composition of the alignment dataset. To address this limitation, we propose Pharmacist, a safety alignment data curation solution that enhances defense against harmful fine-tuning by selecting a high-quality and safety-critical core subset from the original alignment data. The core idea of Pharmacist is to train an alignment data selector to rank alignment data. Specifically, up-ranking high-quality and safety-critical alignment data, down-ranking low-quality and non-safety-critical data. Empirical results indicate that models trained on datasets selected by Pharmacist outperform those trained on datasets selected by existing selection methods in both defense and inference performance. In addition, Pharmacist can be effectively integrated with mainstream alignment-stage defense methods. For example, when applied to RepNoise and T-Vaccine, using the dataset selected by Pharmacist instead of the full dataset leads to improvements in defense performance by 2.60\% and 3.30\%, respectively, and enhances inference performance by 3.50\% and 1.10\%. Notably, it reduces training time by 56.83\% and 57.63\%, respectively. Our code is available at https://github.com/Lslland/Pharmacist.

翻译：有害微调问题对大型语言模型的微调即服务构成了重大的安全挑战。现有的对齐阶段防御方法，例如Vaccine、Repnoise、Booster和T-Vaccine，通过在模型对齐阶段增强其鲁棒性来缓解有害微调问题。尽管这些方法已被提出以缓解该问题，但它们往往忽视了一个关键的上游因素：原始安全对齐数据的作用。我们观察到，其防御性能和计算效率仍然受到对齐数据集质量和构成的限制。为应对这一局限，我们提出了药剂师（Pharmacist），一种安全对齐数据精选解决方案，它通过从原始对齐数据中筛选出高质量且安全关键的核心子集，来增强针对有害微调的防御能力。药剂师的核心思想是训练一个对齐数据选择器来对对齐数据进行排序。具体而言，提升高质量且安全关键的对齐数据的排名，降低低质量且非安全关键数据的排名。实证结果表明，在通过药剂师筛选的数据集上训练的模型，在防御性能和推理性能上均优于使用现有选择方法筛选的数据集训练的模型。此外，药剂师能够有效地与主流对齐阶段防御方法相结合。例如，当应用于RepNoise和T-Vaccine时，使用药剂师筛选的数据集替代完整数据集，分别使防御性能提升了2.60%和3.30%，并使推理性能提升了3.50%和1.10%。值得注意的是，它分别将训练时间减少了56.83%和57.63%。我们的代码可在 https://github.com/Lslland/Pharmacist 获取。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

Understanding Color and the In-Camera Image Processing Pipeline for Computer Vision 【Michael S. Brown IEEE】韩国 ICCV 2019

专知会员服务

10+阅读 · 2019年10月30日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日