Pre-trained models (PTMs) have been widely used in various downstream tasks. The parameters of PTMs are distributed on the Internet and may suffer backdoor attacks. In this work, we demonstrate the universal vulnerability of PTMs, where fine-tuned PTMs can be easily controlled by backdoor attacks in arbitrary downstream tasks. Specifically, attackers can add a simple pre-training task, which restricts the output representations of trigger instances to pre-defined vectors, namely neuron-level backdoor attack (NeuBA). If the backdoor functionality is not eliminated during fine-tuning, the triggers can make the fine-tuned model predict fixed labels by pre-defined vectors. In the experiments of both natural language processing (NLP) and computer vision (CV), we show that NeuBA absolutely controls the predictions for trigger instances without any knowledge of downstream tasks. Finally, we apply several defense methods to NeuBA and find that model pruning is a promising direction to resist NeuBA by excluding backdoored neurons. Our findings sound a red alarm for the wide use of PTMs. Our source code and models are available at \url{https://github.com/thunlp/NeuBA}.
翻译:预训练模型(PTMs)已广泛应用于各类下游任务中。由于PTM的参数通过互联网分发,可能遭受后门攻击。本文揭示了PTM的普遍脆弱性:微调后的PTM可在任意下游任务中被后门攻击轻易控制。具体而言,攻击者可添加一个简单的预训练任务,将触发实例的输出表示限制为预定义向量,即神经元级后门攻击(NeuBA)。若微调过程中后门功能未被消除,触发机制可通过预定义向量使微调模型预测固定标签。在自然语言处理(NLP)和计算机视觉(CV)实验中,我们证明NeuBA能在无需了解下游任务的情况下完全控制触发实例的预测结果。最后,我们针对NeuBA应用多种防御方法,发现模型剪枝通过剔除包含后门的神经元成为抵御NeuBA的有效方向。我们的发现为PTM的广泛使用敲响了红色警报。源代码和模型已发布于\url{https://github.com/thunlp/NeuBA}。