Backdoors implanted in pre-trained language models (PLMs) can be transferred to various downstream tasks, which exposes a severe security threat. However, most existing backdoor attacks against PLMs are un-targeted and task-specific. Few targeted and task-agnostic methods use manually pre-defined triggers and output representations, which prevent the attacks from being more effective and general. In this paper, we first summarize the requirements that a more threatening backdoor attack against PLMs should satisfy, and then propose a new backdoor attack method called UOR, which breaks the bottleneck of the previous approach by turning manual selection into automatic optimization. Specifically, we define poisoned supervised contrastive learning which can automatically learn the more uniform and universal output representations of triggers for various PLMs. Moreover, we use gradient search to select appropriate trigger words which can be adaptive to different PLMs and vocabularies. Experiments show that our method can achieve better attack performance on various text classification tasks compared to manual methods. Further, we tested our method on PLMs with different architectures, different usage paradigms, and more difficult tasks, which demonstrated the universality of our method.
翻译:摘要:植入预训练语言模型的后门会迁移至各类下游任务,这暴露出严重的安全威胁。然而,现有针对预训练语言模型的后门攻击大多是非目标性的且任务特定。少数具有目标性和任务无关性的方法依赖人工预定义的触发器与输出表征,这限制了攻击的有效性和通用性。本文首先归纳了更具威胁性的后门攻击需满足的条件,进而提出新型后门攻击方法UOR,该方法通过将人工选择转为自动优化突破了先前方法的瓶颈。具体而言,我们定义了中毒监督对比学习,该方法能自动学习针对不同预训练语言模型更统一且通用的触发器输出表征。此外,我们采用梯度搜索选择适配不同预训练语言模型与词表的触发器词汇。实验表明,相较人工方法,本方法在多种文本分类任务中能实现更优的攻击性能。我们进一步在具有不同架构、不同使用范式及更具挑战性任务的预训练语言模型上验证了该方法的通用性。