The frustratingly fragile nature of neural network models make current natural language generation (NLG) systems prone to backdoor attacks and generate malicious sequences that could be sexist or offensive. Unfortunately, little effort has been invested to how backdoor attacks can affect current NLG models and how to defend against these attacks. In this work, by giving a formal definition of backdoor attack and defense, we investigate this problem on two important NLG tasks, machine translation and dialog generation. Tailored to the inherent nature of NLG models (e.g., producing a sequence of coherent words given contexts), we design defending strategies against attacks. We find that testing the backward probability of generating sources given targets yields effective defense performance against all different types of attacks, and is able to handle the {\it one-to-many} issue in many NLG tasks such as dialog generation. We hope that this work can raise the awareness of backdoor risks concealed in deep NLG systems and inspire more future work (both attack and defense) towards this direction.
翻译:神经网络模型令人沮丧的脆弱性使得当前的文本生成系统容易遭受后门攻击,从而生成可能带有性别歧视或冒犯性的恶意序列。然而,目前关于后门攻击如何影响当前文本生成模型以及如何防御这些攻击的研究投入甚少。本文通过给出后门攻击与防御的形式化定义,在机器翻译和对话生成这两项重要文本生成任务中对该问题展开研究。针对文本生成模型的固有特性(例如根据上下文生成连贯的词序列),我们设计了针对攻击的防御策略。研究发现,基于给定目标生成源的逆向概率测试方法对所有类型的攻击均能产生有效的防御性能,并能处理文本生成任务(如对话生成)中常见的"一对多"问题。我们希望这项工作能唤起学界对深度文本生成系统中隐藏后门风险的关注,并推动该方向更多后续研究(包括攻击与防御两方面)。