TextDefense: Adversarial Text Detection based on Word Importance Entropy

Currently, natural language processing (NLP) models are wildly used in various scenarios. However, NLP models, like all deep models, are vulnerable to adversarially generated text. Numerous works have been working on mitigating the vulnerability from adversarial attacks. Nevertheless, there is no comprehensive defense in existing works where each work targets a specific attack category or suffers from the limitation of computation overhead, irresistible to adaptive attack, etc. In this paper, we exhaustively investigate the adversarial attack algorithms in NLP, and our empirical studies have discovered that the attack algorithms mainly disrupt the importance distribution of words in a text. A well-trained model can distinguish subtle importance distribution differences between clean and adversarial texts. Based on this intuition, we propose TextDefense, a new adversarial example detection framework that utilizes the target model's capability to defend against adversarial attacks while requiring no prior knowledge. TextDefense differs from previous approaches, where it utilizes the target model for detection and thus is attack type agnostic. Our extensive experiments show that TextDefense can be applied to different architectures, datasets, and attack methods and outperforms existing methods. We also discover that the leading factor influencing the performance of TextDefense is the target model's generalizability. By analyzing the property of the target model and the property of the adversarial example, we provide our insights into the adversarial attacks in NLP and the principles of our defense method.

翻译：目前，自然语言处理（NLP）模型被广泛应用于各种场景。然而，与所有深度模型一样，NLP模型容易受到对抗性生成的文本的攻击。已有大量研究致力于减轻此类对抗攻击带来的脆弱性。然而，现有工作中尚未出现全面的防御方法：每项工作均针对特定攻击类别，或存在计算开销较大、难以抵御自适应攻击等局限性。本文全面研究了NLP中的对抗攻击算法，实证分析发现，攻击算法主要破坏了文本中词语的重要性分布。一个训练良好的模型能够区分干净文本与对抗文本在重要性分布上的细微差异。基于这一直觉，我们提出了TextDefense——一种全新的对抗样本检测框架，该框架利用目标模型自身的防御能力来抵御对抗攻击，且无需任何先验知识。与以往方法不同，TextDefense通过调用目标模型进行检测，因此与攻击类型无关。大量实验表明，TextDefense可适用于不同架构、数据集及攻击方法，且性能优于现有方法。我们还发现，影响TextDefense性能的主导因素是目标模型的泛化能力。通过分析目标模型与对抗样本的特性，我们提供了对NLP中对抗攻击的见解及防御方法的设计原则。

相关内容

MoDELS

关注 46

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/