Adversarial attack serves as a major challenge for neural network models in NLP, which precludes the model's deployment in safety-critical applications. A recent line of work, detection-based defense, aims to distinguish adversarial sentences from benign ones. However, {the core limitation of previous detection methods is being incapable of giving correct predictions on adversarial sentences unlike defense methods from other paradigms.} To solve this issue, this paper proposes TextShield: (1) we discover a link between text attack and saliency information, and then we propose a saliency-based detector, which can effectively detect whether an input sentence is adversarial or not. (2) We design a saliency-based corrector, which converts the detected adversary sentences to benign ones. By combining the saliency-based detector and corrector, TextShield extends the detection-only paradigm to a detection-correction paradigm, thus filling the gap in the existing detection-based defense. Comprehensive experiments show that (a) TextShield consistently achieves higher or comparable performance than state-of-the-art defense methods across various attacks on different benchmarks. (b) our saliency-based detector outperforms existing detectors for detecting adversarial sentences.
翻译:对抗攻击是自然语言处理中神经网络模型面临的主要挑战,阻碍了模型在安全关键型应用中的部署。近期一类基于检测的防御方法旨在区分对抗句子与良性句子。然而,以往检测方法的核心局限性在于无法像其他范式的防御方法那样对对抗句子给出正确预测。为解决此问题,本文提出TextShield:(1) 我们发现文本攻击与显著信息之间的关联,进而提出基于显著性的检测器,能有效判断输入句子是否为对抗性;(2) 我们设计基于显著性的校正器,将检测到的对抗句子转换为良性句子。通过结合基于显著性的检测器与校正器,TextShield将仅检测范式扩展为检测-校正范式,从而填补了现有基于检测的防御中的空白。综合实验表明:(a) 在不同基准数据集上面对多种攻击时,TextShield始终取得与最先进防御方法相当或更优的性能;(b) 我们提出的基于显著性的检测器在检测对抗句子方面优于现有检测器。