Backdoors implanted in pre-trained language models (PLMs) can be transferred to various downstream tasks, which exposes a severe security threat. However, most existing backdoor attacks against PLMs are un-targeted and task-specific. Few targeted and task-agnostic methods use manually pre-defined triggers and output representations, which prevent the attacks from being more effective and general. In this paper, we first summarize the requirements that a more threatening backdoor attack against PLMs should satisfy, and then propose a new backdoor attack method called UOR, which breaks the bottleneck of the previous approach by turning manual selection into automatic optimization. Specifically, we define poisoned supervised contrastive learning which can automatically learn the more uniform and universal output representations of triggers for various PLMs. Moreover, we use gradient search to select appropriate trigger words which can be adaptive to different PLMs and vocabularies. Experiments show that our method can achieve better attack performance on various text classification tasks compared to manual methods. Further, we tested our method on PLMs with different architectures, different usage paradigms, and more difficult tasks, which demonstrated the universality of our method.
翻译:植入预训练语言模型(PLMs)的后门可迁移至多种下游任务,这构成了严重的安全威胁。然而,现有大多数针对PLMs的后门攻击是非定向且任务特定的。少数定向且任务无关的方法采用人工预定义的触发词和输出表示,这限制了攻击的效能与普适性。本文首先总结了更具威胁性的PLMs后门攻击应满足的要求,进而提出一种名为UOR的新型后门攻击方法,该方法通过将人工选择转为自动优化,突破了现有方法的瓶颈。具体而言,我们定义了带毒监督对比学习,可自动学习适用于不同PLMs的、更均匀且通用的触发词输出表示。此外,我们采用梯度搜索选择适配不同PLMs及词表的触发词。实验表明,相较于人工方法,本方法在多种文本分类任务上能实现更优的攻击性能。进一步地,我们在不同架构、不同使用范式以及更困难任务场景下的PLMs中测试了本方法,结果验证了其普适性。