Adversarial purification is a successful defense mechanism against adversarial attacks without requiring knowledge of the form of the incoming attack. Generally, adversarial purification aims to remove the adversarial perturbations therefore can make correct predictions based on the recovered clean samples. Despite the success of adversarial purification in the computer vision field that incorporates generative models such as energy-based models and diffusion models, using purification as a defense strategy against textual adversarial attacks is rarely explored. In this work, we introduce a novel adversarial purification method that focuses on defending against textual adversarial attacks. With the help of language models, we can inject noise by masking input texts and reconstructing the masked texts based on the masked language models. In this way, we construct an adversarial purification process for textual models against the most widely used word-substitution adversarial attacks. We test our proposed adversarial purification method on several strong adversarial attack methods including Textfooler and BERT-Attack and experimental results indicate that the purification algorithm can successfully defend against strong word-substitution attacks.
翻译:对抗净化是一种成功的对抗攻击防御机制,无需了解攻击形式即可发挥作用。通常,对抗净化的目标是去除对抗扰动,从而基于恢复的干净样本做出正确预测。尽管在计算机视觉领域,结合能量模型和扩散模型等生成模型的对抗净化方法取得了成功,但将净化作为防御文本对抗攻击的策略却鲜有探索。本文提出了一种新颖的对抗净化方法,专注于防御文本对抗攻击。借助语言模型,我们通过掩蔽输入文本并基于掩蔽语言模型重构掩蔽文本来注入噪声。通过这种方式,我们为文本模型构建了一个对抗净化过程,以抵御最常用的单词替换对抗攻击。我们在包括Textfooler和BERT-Attack在内的几种强对抗攻击方法上测试了所提出的对抗净化方法,实验结果表明该净化算法能够成功防御强单词替换攻击。