Language models are often at risk of diverse backdoor attacks, especially data poisoning. Thus, it is important to investigate defense solutions for addressing them. Existing backdoor defense methods mainly focus on backdoor attacks with explicit triggers, leaving a universal defense against various backdoor attacks with diverse triggers largely unexplored. In this paper, we propose an end-to-end ensemble-based backdoor defense framework, DPoE (Denoised Product-of-Experts), which is inspired by the shortcut nature of backdoor attacks, to defend various backdoor attacks. DPoE consists of two models: a shallow model that captures the backdoor shortcuts and a main model that is prevented from learning the backdoor shortcuts. To address the label flip caused by backdoor attackers, DPoE incorporates a denoising design. Experiments on SST-2 dataset show that DPoE significantly improves the defense performance against various types of backdoor triggers including word-level, sentence-level, and syntactic triggers. Furthermore, DPoE is also effective under a more challenging but practical setting that mixes multiple types of trigger.
翻译:语言模型常面临多种后门攻击的风险,尤其是数据投毒攻击。因此,探究针对此类攻击的防御方案至关重要。现有后门防御方法主要聚焦于具有显式触发器的后门攻击,而针对多种触发器类型的通用防御机制尚未得到充分探索。本文受后门攻击的捷径特性启发,提出了一种端到端的集成式后门防御框架DPoE(去噪专家乘积模型),可防御多种后门攻击。DPoE由两个模型构成:用于捕获后门捷径的浅层模型,以及被阻止学习后门捷径的主模型。为应对后门攻击者造成的标签翻转问题,DPoE引入了去噪设计。在SST-2数据集上的实验表明,DPoE显著提升了对词级、句级及句法级等多种后门触发器的防御性能。此外,在混合多种类型触发器的更具挑战性的实际场景中,DPoE依然表现有效。