Pre-trained language models (PLMs) have been found susceptible to backdoor attacks, which can transfer vulnerabilities to various downstream tasks. However, existing PLM backdoors are conducted with explicit triggers under the manually aligned, thus failing to satisfy expectation goals simultaneously in terms of effectiveness, stealthiness, and universality. In this paper, we propose a novel approach to achieve invisible and general backdoor implantation, called \textbf{Syntactic Ghost} (synGhost for short). Specifically, the method hostilely manipulates poisoned samples with different predefined syntactic structures as stealth triggers and then implants the backdoor to pre-trained representation space without disturbing the primitive knowledge. The output representations of poisoned samples are distributed as uniformly as possible in the feature space via contrastive learning, forming a wide range of backdoors. Additionally, in light of the unique properties of syntactic triggers, we introduce an auxiliary module to drive the PLMs to learn this knowledge in priority, which can alleviate the interference between different syntactic structures. Experiments show that our method outperforms the previous methods and achieves the predefined objectives. Not only do severe threats to various natural language understanding (NLU) tasks on two tuning paradigms but also to multiple PLMs. Meanwhile, the synGhost is imperceptible against three countermeasures based on perplexity, fine-pruning, and the proposed maxEntropy.
翻译:预训练语言模型已被发现易受后门攻击影响,这种攻击能将漏洞传递至各类下游任务。然而,现有预训练语言模型后门攻击均采用显式触发器且依赖手动对齐,难以在有效性、隐蔽性和通用性方面同时满足预期目标。本文提出一种实现隐形通用后门植入的新方法,命名为**Syntactic Ghost**(简称synGhost)。具体而言,该方法通过恶意操控中毒样本,将不同预定义句法结构作为隐形触发器,在不破坏原始知识的前提下将后门植入预训练表示空间。通过对比学习使中毒样本的输出表示在特征空间中尽可能均匀分布,形成广泛的后门覆盖。此外,针对句法触发器的独特属性,我们引入辅助模块驱动预训练语言模型优先学习此类知识,从而缓解不同句法结构间的干扰。实验表明,本方法优于现有方案并实现预期目标:不仅对两种微调范式下的多种自然语言理解任务构成严重威胁,且能影响多个预训练语言模型。同时,synGhost能够抵御基于困惑度、精细剪枝及本文提出的最大熵三种防御手段的检测。