Recently, various parameter-efficient fine-tuning (PEFT) strategies for application to language models have been proposed and successfully implemented. However, this raises the question of whether PEFT, which only updates a limited set of model parameters, constitutes security vulnerabilities when confronted with weight-poisoning backdoor attacks. In this study, we show that PEFT is more susceptible to weight-poisoning backdoor attacks compared to the full-parameter fine-tuning method, with pre-defined triggers remaining exploitable and pre-defined targets maintaining high confidence, even after fine-tuning. Motivated by this insight, we developed a Poisoned Sample Identification Module (PSIM) leveraging PEFT, which identifies poisoned samples through confidence, providing robust defense against weight-poisoning backdoor attacks. Specifically, we leverage PEFT to train the PSIM with randomly reset sample labels. During the inference process, extreme confidence serves as an indicator for poisoned samples, while others are clean. We conduct experiments on text classification tasks, five fine-tuning strategies, and three weight-poisoning backdoor attack methods. Experiments show near 100% success rates for weight-poisoning backdoor attacks when utilizing PEFT. Furthermore, our defensive approach exhibits overall competitive performance in mitigating weight-poisoning backdoor attacks.
翻译:近期,针对语言模型的各种参数高效微调策略已被提出并成功实施。然而,这引发了一个问题:仅更新有限模型参数的参数高效微调在面对权重投毒后门攻击时是否构成安全漏洞。本研究表明,与全参数微调方法相比,参数高效微调对权重投毒后门攻击更为敏感,即使经过微调,预定义触发器仍可被利用,预定义目标仍保持高置信度。受此启发,我们利用参数高效微调开发了毒害样本识别模块(PSIM),通过置信度识别毒害样本,提供针对权重投毒后门攻击的稳健防御。具体而言,我们利用参数高效微调使用随机重置的样本标签训练PSIM。在推理过程中,极高置信度作为毒害样本的指示器,其他样本则为清洁样本。我们在文本分类任务、五种微调策略和三种权重投毒后门攻击方法上进行了实验。实验表明,使用参数高效微调时,权重投毒后门攻击的成功率接近100%。此外,我们的防御方法在缓解权重投毒后门攻击方面整体表现出有竞争力的性能。