Currently, natural language processing (NLP) models are wildly used in various scenarios. However, NLP models, like all deep models, are vulnerable to adversarially generated text. Numerous works have been working on mitigating the vulnerability from adversarial attacks. Nevertheless, there is no comprehensive defense in existing works where each work targets a specific attack category or suffers from the limitation of computation overhead, irresistible to adaptive attack, etc. In this paper, we exhaustively investigate the adversarial attack algorithms in NLP, and our empirical studies have discovered that the attack algorithms mainly disrupt the importance distribution of words in a text. A well-trained model can distinguish subtle importance distribution differences between clean and adversarial texts. Based on this intuition, we propose TextDefense, a new adversarial example detection framework that utilizes the target model's capability to defend against adversarial attacks while requiring no prior knowledge. TextDefense differs from previous approaches, where it utilizes the target model for detection and thus is attack type agnostic. Our extensive experiments show that TextDefense can be applied to different architectures, datasets, and attack methods and outperforms existing methods. We also discover that the leading factor influencing the performance of TextDefense is the target model's generalizability. By analyzing the property of the target model and the property of the adversarial example, we provide our insights into the adversarial attacks in NLP and the principles of our defense method.
翻译:目前,自然语言处理(NLP)模型被广泛应用于各种场景。然而,与所有深度模型一样,NLP模型容易受到对抗性生成的文本的攻击。已有大量研究致力于减轻此类对抗攻击带来的脆弱性。然而,现有工作中尚未出现全面的防御方法:每项工作均针对特定攻击类别,或存在计算开销较大、难以抵御自适应攻击等局限性。本文全面研究了NLP中的对抗攻击算法,实证分析发现,攻击算法主要破坏了文本中词语的重要性分布。一个训练良好的模型能够区分干净文本与对抗文本在重要性分布上的细微差异。基于这一直觉,我们提出了TextDefense——一种全新的对抗样本检测框架,该框架利用目标模型自身的防御能力来抵御对抗攻击,且无需任何先验知识。与以往方法不同,TextDefense通过调用目标模型进行检测,因此与攻击类型无关。大量实验表明,TextDefense可适用于不同架构、数据集及攻击方法,且性能优于现有方法。我们还发现,影响TextDefense性能的主导因素是目标模型的泛化能力。通过分析目标模型与对抗样本的特性,我们提供了对NLP中对抗攻击的见解及防御方法的设计原则。