Here's a Free Lunch: Sanitizing Backdoored Models with Model Merge

The democratization of pre-trained language models through open-source initiatives has rapidly advanced innovation and expanded access to cutting-edge technologies. However, this openness also brings significant security risks, including backdoor attacks, where hidden malicious behaviors are triggered by specific inputs, compromising natural language processing (NLP) system integrity and reliability. This paper suggests that merging a backdoored model with other homogeneous models can significantly remediate backdoor vulnerabilities even if such models are not entirely secure. In our experiments, we verify our hypothesis on various models (BERT-Base, RoBERTa-Large, Llama2-7B, and Mistral-7B) and datasets (SST-2, OLID, AG News, and QNLI). Compared to multiple advanced defensive approaches, our method offers an effective and efficient inference-stage defense against backdoor attacks on classification and instruction-tuned tasks without additional resources or specific knowledge. Our approach consistently outperforms recent advanced baselines, leading to an average of about 75% reduction in the attack success rate. Since model merging has been an established approach for improving model performance, the extra advantage it provides regarding defense can be seen as a cost-free bonus.

翻译：开源计划通过预训练语言模型的民主化迅速推动了创新，并扩大了前沿技术的获取途径。然而，这种开放性也带来了重大的安全风险，包括后门攻击——即特定输入会触发隐藏的恶意行为，从而损害自然语言处理（NLP）系统的完整性和可靠性。本文提出，将后门模型与其他同质模型融合，可以显著修复后门漏洞，即使这些模型本身并非完全安全。在我们的实验中，我们在多种模型（BERT-Base、RoBERTa-Large、Llama2-7B和Mistral-7B）和数据集（SST-2、OLID、AG News和QNLI）上验证了我们的假设。与多种先进防御方法相比，我们的方法为分类和指令微调任务的后门攻击提供了一种高效且无需额外资源或特定知识的推理阶段防御。我们的方法持续优于近期先进的基线模型，平均将攻击成功率降低了约75%。由于模型融合已是提升模型性能的成熟方法，其在防御方面提供的额外优势可被视为一种零成本的额外收益。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/