Recently, various parameter-efficient fine-tuning (PEFT) strategies for application to language models have been proposed and successfully implemented. However, this raises the question of whether PEFT, which only updates a limited set of model parameters, constitutes security vulnerabilities when confronted with weight-poisoning backdoor attacks. In this study, we show that PEFT is more susceptible to weight-poisoning backdoor attacks compared to the full-parameter fine-tuning method, with pre-defined triggers remaining exploitable and pre-defined targets maintaining high confidence, even after fine-tuning. Motivated by this insight, we developed a Poisoned Sample Identification Module (PSIM) leveraging PEFT, which identifies poisoned samples through confidence, providing robust defense against weight-poisoning backdoor attacks. Specifically, we leverage PEFT to train the PSIM with randomly reset sample labels. During the inference process, extreme confidence serves as an indicator for poisoned samples, while others are clean. We conduct experiments on text classification tasks, five fine-tuning strategies, and three weight-poisoning backdoor attack methods. Experiments show near 100% success rates for weight-poisoning backdoor attacks when utilizing PEFT. Furthermore, our defensive approach exhibits overall competitive performance in mitigating weight-poisoning backdoor attacks.
翻译:近期,针对语言模型的各种参数高效微调策略被提出并成功应用。然而,这引发了一个问题:仅更新有限模型参数的参数高效微调在面临权重投毒后门攻击时是否存在安全漏洞?本研究表明,与全参数微调方法相比,参数高效微调更容易受到权重投毒后门攻击——即使经过微调,预定义的触发器仍可被利用,预定义的目标仍保持高置信度。基于这一发现,我们开发了一种利用参数高效微调的有毒样本识别模块,通过置信度识别有毒样本,从而提供针对权重投毒后门攻击的鲁棒防御。具体而言,我们利用参数高效微调以随机重置的样本标签训练有毒样本识别模块。在推理过程中,极端置信度作为有毒样本的指示信号,其余样本则为干净样本。我们在文本分类任务、五种微调策略和三种权重投毒后门攻击方法上进行了实验。实验表明,使用参数高效微调时,权重投毒后门攻击的成功率接近100%。此外,我们的防御方法在缓解权重投毒后门攻击方面展现出整体竞争性表现。