Modern NLP models are often trained over large untrusted datasets, raising the potential for a malicious adversary to compromise model behaviour. For instance, backdoors can be implanted through crafting training instances with a specific textual trigger and a target label. This paper posits that backdoor poisoning attacks exhibit \emph{spurious correlation} between simple text features and classification labels, and accordingly, proposes methods for mitigating spurious correlation as means of defence. Our empirical study reveals that the malicious triggers are highly correlated to their target labels; therefore such correlations are extremely distinguishable compared to those scores of benign features, and can be used to filter out potentially problematic instances. Compared with several existing defences, our defence method significantly reduces attack success rates across backdoor attacks, and in the case of insertion-based attacks, our method provides a near-perfect defence.
翻译:现代NLP模型通常在大规模不可信数据集上训练,这为恶意攻击者操纵模型行为提供了可能性。例如,攻击者可通过构造包含特定文本触发词和目标标签的训练样本来植入后门。本文提出,后门投毒攻击在简单文本特征与分类标签之间表现出"虚假关联",并据此提出以缓解虚假关联作为防御手段的方法。我们的实证研究表明,恶意触发词与其目标标签之间存在高度关联;这种关联的显著程度远超良性特征的关联分数,因此可用于筛选潜在的异常样本。与现有多种防御方法相比,我们的防御方法能显著降低各类后门攻击的成功率,在插入式攻击场景中更是实现了接近完美的防御效果。