Backdoor attacks have become an emerging threat to NLP systems. By providing poisoned training data, the adversary can embed a ``backdoor'' into the victim model, which allows input instances satisfying certain textual patterns (e.g., containing a keyword) to be predicted as a target label of the adversary's choice. In this paper, we demonstrate that it's possible to design a backdoor attack that is both stealthy (i.e., hard to notice) and effective (i.e., has a high attack success rate). We propose BITE, a backdoor attack that poisons the training data to establish strong correlations between the target label and some ``trigger words'', by iteratively injecting them into target-label instances through natural word-level perturbations. The poisoned training data instruct the victim model to predict the target label on inputs containing trigger words, forming the backdoor. Experiments on four medium-sized text classification datasets show that BITE is significantly more effective than baselines while maintaining decent stealthiness, raising alarm on the usage of untrusted training data. We further propose a defense method named DeBITE based on potential trigger word removal, which outperforms existing methods on defending BITE and generalizes well to defending other backdoor attacks.
翻译:后门攻击已成为NLP系统面临的新兴威胁。通过提供被投毒的训练数据,攻击者可在受害模型中嵌入“后门”,使得满足特定文本模式(如包含关键词)的输入实例被预测为攻击者选择的目标标签。本文证明,设计兼具隐蔽性(即难以察觉)与有效性(即高攻击成功率)的后门攻击是可行的。我们提出BITE,一种通过自然词语级扰动将“触发词”迭代注入目标标签实例,从而建立目标标签与某些触发词之间强关联的后门攻击。被投毒的训练数据指令受害模型将包含触发词的输入预测为目标标签,形成后门。在四个中等规模文本分类数据集上的实验表明,BITE在保持良好隐蔽性的同时,其有效性显著优于基线方法,这为不可信训练数据的使用敲响了警钟。我们进一步提出基于潜在触发词移除的防御方法DeBITE,其在防御BITE方面优于现有方法,并能泛化用于防御其他后门攻击。