The democratization of pre-trained language models through open-source initiatives has rapidly advanced innovation and expanded access to cutting-edge technologies. However, this openness also brings significant security risks, including backdoor attacks, where hidden malicious behaviors are triggered by specific inputs, compromising natural language processing (NLP) system integrity and reliability. This paper suggests that merging a backdoored model with other homogeneous models can significantly remediate backdoor vulnerabilities even if such models are not entirely secure. In our experiments, we verify our hypothesis on various models (BERT-Base, RoBERTa-Large, Llama2-7B, and Mistral-7B) and datasets (SST-2, OLID, AG News, and QNLI). Compared to multiple advanced defensive approaches, our method offers an effective and efficient inference-stage defense against backdoor attacks on classification and instruction-tuned tasks without additional resources or specific knowledge. Our approach consistently outperforms recent advanced baselines, leading to an average of about 75% reduction in the attack success rate. Since model merging has been an established approach for improving model performance, the extra advantage it provides regarding defense can be seen as a cost-free bonus.
翻译:开源计划通过预训练语言模型的民主化迅速推动了创新,并扩大了前沿技术的获取途径。然而,这种开放性也带来了重大的安全风险,包括后门攻击——即特定输入会触发隐藏的恶意行为,从而损害自然语言处理(NLP)系统的完整性和可靠性。本文提出,将后门模型与其他同质模型融合,可以显著修复后门漏洞,即使这些模型本身并非完全安全。在我们的实验中,我们在多种模型(BERT-Base、RoBERTa-Large、Llama2-7B和Mistral-7B)和数据集(SST-2、OLID、AG News和QNLI)上验证了我们的假设。与多种先进防御方法相比,我们的方法为分类和指令微调任务的后门攻击提供了一种高效且无需额外资源或特定知识的推理阶段防御。我们的方法持续优于近期先进的基线模型,平均将攻击成功率降低了约75%。由于模型融合已是提升模型性能的成熟方法,其在防御方面提供的额外优势可被视为一种零成本的额外收益。