Recently, various parameter-efficient fine-tuning (PEFT) strategies for application to language models have been proposed and successfully implemented. However, this raises the question of whether PEFT, which only updates a limited set of model parameters, constitutes security vulnerabilities when confronted with weight-poisoning backdoor attacks. In this study, we show that PEFT is more susceptible to weight-poisoning backdoor attacks compared to the full-parameter fine-tuning method, with pre-defined triggers remaining exploitable and pre-defined targets maintaining high confidence, even after fine-tuning. Motivated by this insight, we developed a Poisoned Sample Identification Module (PSIM) leveraging PEFT, which identifies poisoned samples through confidence, providing robust defense against weight-poisoning backdoor attacks. Specifically, we leverage PEFT to train the PSIM with randomly reset sample labels. During the inference process, extreme confidence serves as an indicator for poisoned samples, while others are clean. We conduct experiments on text classification tasks, five fine-tuning strategies, and three weight-poisoning backdoor attack methods. Experiments show near 100% success rates for weight-poisoning backdoor attacks when utilizing PEFT. Furthermore, our defensive approach exhibits overall competitive performance in mitigating weight-poisoning backdoor attacks.
翻译:近期,针对语言模型的各种参数高效微调策略被提出并成功实施。然而,这引发了一个问题:仅更新有限模型参数的参数高效微调在面对权重投毒后门攻击时,是否存在安全漏洞。在本研究中,我们证明,与全参数微调方法相比,参数高效微调更容易受到权重投毒后门攻击的影响——即使经过微调,预定义触发器仍可被利用,且预定义目标仍保持高置信度。受此启发,我们开发了一种基于参数高效微调的毒样本识别模块,该模块通过置信度识别毒样本,为权重投毒后门攻击提供了稳健的防御。具体而言,我们利用参数高效微调,以随机重置的样本标签训练毒样本识别模块。在推理过程中,极端置信度作为毒样本的指示信号,而其他样本则为干净样本。我们在文本分类任务、五种微调策略和三种权重投毒后门攻击方法上进行了实验。实验结果表明,当采用参数高效微调时,权重投毒后门攻击的成功率接近100%。此外,我们的防御方法在减轻权重投毒后门攻击方面整体表现出具有竞争力的性能。