Deep neural networks have a wide range of applications in solving various real-world tasks and have achieved satisfactory results, in domains such as computer vision, image classification, and natural language processing. Meanwhile, the security and robustness of neural networks have become imperative, as diverse researches have shown the vulnerable aspects of neural networks. Case in point, in Natural language processing tasks, the neural network may be fooled by an attentively modified text, which has a high similarity to the original one. As per previous research, most of the studies are focused on the image domain; Different from image adversarial attacks, the text is represented in a discrete sequence, traditional image attack methods are not applicable in the NLP field. In this paper, we propose a word-level NLP sentiment classifier attack model, which includes a self-attention mechanism-based word selection method and a greedy search algorithm for word substitution. We experiment with our attack model by attacking GRU and 1D-CNN victim models on IMDB datasets. Experimental results demonstrate that our model achieves a higher attack success rate and more efficient than previous methods due to the efficient word selection algorithms are employed and minimized the word substitute number. Also, our model is transferable, which can be used in the image domain with several modifications.
翻译:深度神经网络在解决各类实际任务中具有广泛应用,并在计算机视觉、图像分类和自然语言处理等领域取得了令人满意的成果。与此同时,神经网络的安全性和鲁棒性已成为亟待解决的问题,因为多项研究揭示了神经网络的脆弱性。例如,在自然语言处理任务中,神经网络可能被经过精心修改的高相似度文本所欺骗。既往研究大多聚焦于图像领域;与图像对抗攻击不同,文本以离散序列形式呈现,传统图像攻击方法不适用于NLP领域。本文提出一种词级NLP情感分类器攻击模型,该方法包含基于自注意力机制的词语选择策略和用于词语替换的贪心搜索算法。我们通过在IMDB数据集上攻击GRU和1D-CNN受害模型来验证所提攻击模型。实验结果表明,由于采用了高效的词语选择算法并最小化了词语替换数量,我们的模型相比现有方法具有更高的攻击成功率和效率。此外,该模型具有可迁移性,经过适当修改后可用于图像领域。