Backdoor attack aims at inducing neural models to make incorrect predictions for poison data while keeping predictions on the clean dataset unchanged, which creates a considerable threat to current natural language processing (NLP) systems. Existing backdoor attacking systems face two severe issues:firstly, most backdoor triggers follow a uniform and usually input-independent pattern, e.g., insertion of specific trigger words, synonym replacement. This significantly hinders the stealthiness of the attacking model, leading the trained backdoor model being easily identified as malicious by model probes. Secondly, trigger-inserted poisoned sentences are usually disfluent, ungrammatical, or even change the semantic meaning from the original sentence, making them being easily filtered in the pre-processing stage. To resolve these two issues, in this paper, we propose an input-unique backdoor attack(NURA), where we generate backdoor triggers unique to inputs. IDBA generates context-related triggers by continuing writing the input with a language model like GPT2. The generated sentence is used as the backdoor trigger. This strategy not only creates input-unique backdoor triggers, but also preserves the semantics of the original input, simultaneously resolving the two issues above. Experimental results show that the IDBA attack is effective for attack and difficult to defend: it achieves high attack success rate across all the widely applied benchmarks, while is immune to existing defending methods. In addition, it is able to generate fluent, grammatical, and diverse backdoor inputs, which can hardly be recognized through human inspection.
翻译:后门攻击旨在诱导神经网络对中毒数据做出错误预测,同时保持对干净数据集的预测不变,这对当前自然语言处理(NLP)系统构成重大威胁。现有后门攻击系统面临两个严重问题:首先,大多数后门触发器遵循统一且通常与输入无关的模式,例如插入特定触发词、同义词替换。这显著降低了攻击模型的隐蔽性,导致训练后的后门模型容易被模型探针识别为恶意。其次,插入触发器的中毒句子通常不流畅、不合语法,甚至改变原句的语义,使其在预处理阶段容易被过滤。为解决这两个问题,本文提出一种输入唯一后门攻击(NURA),其中我们生成输入唯一的后门触发器。IDBA通过使用GPT2等语言模型对输入进行续写,生成上下文相关的触发器,并将生成的句子作为后门触发器。该策略不仅创建了输入唯一的后门触发器,还保留了原始输入的语义,同时解决了上述两个问题。实验结果表明,IDBA攻击对攻击有效且难以防御:它在所有广泛应用的基准测试中均实现高攻击成功率,同时对现有防御方法具有免疫力。此外,它能生成流畅、语法正确且多样化的后门输入,几乎无法通过人工检查识别。