Modern NLP models are often trained over large untrusted datasets, raising the potential for a malicious adversary to compromise model behaviour. For instance, backdoors can be implanted through crafting training instances with a specific textual trigger and a target label. This paper posits that backdoor poisoning attacks exhibit spurious correlation between simple text features and classification labels, and accordingly, proposes methods for mitigating spurious correlation as means of defence. Our empirical study reveals that the malicious triggers are highly correlated to their target labels; therefore such correlations are extremely distinguishable compared to those scores of benign features, and can be used to filter out potentially problematic instances. Compared with several existing defences, our defence method significantly reduces attack success rates across backdoor attacks, and in the case of insertion based attacks, our method provides a near-perfect defence.
翻译:现代自然语言处理模型通常在大规模不可信数据集上进行训练,这增加了恶意攻击者操纵模型行为可能性。例如,攻击者可通过构造包含特定文本触发词和目标标签的训练实例来植入后门。本文指出,后门投毒攻击在简单文本特征与分类标签之间呈现伪相关,并据此提出缓解伪相关的方法作为防御手段。我们的实证研究表明,恶意触发词与其目标标签之间存在高度相关性,此类相关性相较于良性特征得分具有极强的可区分性,可用于过滤潜在问题实例。与现有多种防御方法相比,本文提出的防御方法能显著降低各类后门攻击的成功率,其中针对插入式攻击的防御近乎达到完美效果。