Backdoor attacks have become an emerging threat to NLP systems. By providing poisoned training data, the adversary can embed a "backdoor" into the victim model, which allows input instances satisfying certain textual patterns (e.g., containing a keyword) to be predicted as a target label of the adversary's choice. In this paper, we demonstrate that it is possible to design a backdoor attack that is both stealthy (i.e., hard to notice) and effective (i.e., has a high attack success rate). We propose BITE, a backdoor attack that poisons the training data to establish strong correlations between the target label and a set of "trigger words". These trigger words are iteratively identified and injected into the target-label instances through natural word-level perturbations. The poisoned training data instruct the victim model to predict the target label on inputs containing trigger words, forming the backdoor. Experiments on four text classification datasets show that our proposed attack is significantly more effective than baseline methods while maintaining decent stealthiness, raising alarm on the usage of untrusted training data. We further propose a defense method named DeBITE based on potential trigger word removal, which outperforms existing methods in defending against BITE and generalizes well to handling other backdoor attacks.
翻译:后门攻击已成为NLP系统的新兴威胁。通过提供被投毒的训练数据,攻击者能够在受害模型中嵌入“后门”,使得满足特定文本模式(如包含某个关键词)的输入实例被预测为攻击者选择的目标标签。本文证明,设计一种既隐蔽(即难以察觉)又有效(即具有高攻击成功率)的后门攻击是可行的。我们提出BITE,这是一种通过投毒训练数据来建立目标标签与一组“触发词”之间强关联的后门攻击方法。这些触发词通过自然的词级扰动被迭代识别并注入到目标标签实例中。被投毒的训练数据指示受害模型对包含触发词的输入预测目标标签,从而构成后门。在四个文本分类数据集上的实验表明,我们提出的攻击方法在保持良好隐蔽性的同时,显著优于基线方法,这警示了不可信训练数据的使用风险。我们进一步提出了一种基于潜在触发词移除的防御方法DeBITE,该方法在防御BITE攻击方面优于现有方法,并能很好地泛化到处理其他后门攻击。