Backdoor Attacks for LLMs with Weak-To-Strong Knowledge Distillation

Despite being widely applied due to their exceptional capabilities, Large Language Models (LLMs) have been proven to be vulnerable to backdoor attacks. These attacks introduce targeted vulnerabilities into LLMs by poisoning training samples and full-parameter fine-tuning. However, this kind of backdoor attack is limited since they require significant computational resources, especially as the size of LLMs increases. Besides, parameter-efficient fine-tuning (PEFT) offers an alternative but the restricted parameter updating may impede the alignment of triggers with target labels. In this study, we first verify that backdoor attacks with PEFT may encounter challenges in achieving feasible performance. To address these issues and improve the effectiveness of backdoor attacks with PEFT, we propose a novel backdoor attack algorithm from weak to strong based on feature alignment-enhanced knowledge distillation (W2SAttack). Specifically, we poison small-scale language models through full-parameter fine-tuning to serve as the teacher model. The teacher model then covertly transfers the backdoor to the large-scale student model through feature alignment-enhanced knowledge distillation, which employs PEFT. Theoretical analysis reveals that W2SAttack has the potential to augment the effectiveness of backdoor attacks. We demonstrate the superior performance of W2SAttack on classification tasks across four language models, four backdoor attack algorithms, and two different architectures of teacher models. Experimental results indicate success rates close to 100% for backdoor attacks targeting PEFT.

翻译：尽管大语言模型因其卓越能力而被广泛应用，但已被证明易受后门攻击。此类攻击通过污染训练样本和全参数微调，将定向漏洞引入大语言模型。然而，这类后门攻击存在局限性，因其需要大量计算资源，尤其随着大语言模型规模增大而加剧。此外，参数高效微调虽提供了替代方案，但受限的参数更新可能阻碍触发器与目标标签的对齐。本研究首先验证了采用参数高效微调的后门攻击在实现可行性能方面可能面临挑战。为解决这些问题并提升参数高效微调后门攻击的有效性，我们提出一种基于特征对齐增强知识蒸馏的弱到强新型后门攻击算法（W2SAttack）。具体而言，我们通过全参数微调毒化小规模语言模型作为教师模型，随后教师模型通过采用参数高效微调的特征对齐增强知识蒸馏，将后门隐蔽地迁移至大规模学生模型。理论分析表明W2SAttack具有增强后门攻击效果的潜力。我们在四个语言模型、四种后门攻击算法及两种不同架构的教师模型上，通过分类任务验证了W2SAttack的优越性能。实验结果表明，针对参数高效微调的后门攻击成功率接近100%。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】基于元内存传输的跨域少镜头语义分割，Remember the Difference: Cross-Domain Few-Shot Semantic Segmentation via Meta-Memory Transfer

专知会员服务

13+阅读 · 2022年3月12日

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

【WSDM2020】超越统计关系：将知识关系整合到多标签音乐风格分类的风格关联中（附pdf）

专知会员服务

18+阅读 · 2019年11月23日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日