Despite recent success on various tasks, deep learning techniques still perform poorly on adversarial examples with small perturbations. While optimization-based methods for adversarial attacks are well-explored in the field of computer vision, it is impractical to directly apply them in natural language processing due to the discrete nature of the text. To address the problem, we propose a unified framework to extend the existing optimization-based adversarial attack methods in the vision domain to craft textual adversarial samples. In this framework, continuously optimized perturbations are added to the embedding layer and amplified in the forward propagation process. Then the final perturbed latent representations are decoded with a masked language model head to obtain potential adversarial samples. In this paper, we instantiate our framework with an attack algorithm named Textual Projected Gradient Descent (T-PGD). We find our algorithm effective even using proxy gradient information. Therefore, we perform the more challenging transfer black-box attack and conduct comprehensive experiments to evaluate our attack algorithm with several models on three benchmark datasets. Experimental results demonstrate that our method achieves overall better performance and produces more fluent and grammatical adversarial samples compared to strong baseline methods. The code and data are available at \url{https://github.com/Phantivia/T-PGD}.
翻译:尽管深度学习技术在各种任务上取得了近期成功,但其在微小扰动的对抗样本上仍表现不佳。虽然基于优化的对抗攻击方法在计算机视觉领域已得到充分探索,但由于文本的离散特性,直接将其应用于自然语言处理并不可行。为解决这一问题,我们提出一个统一框架,将视觉领域现有的基于优化的对抗攻击方法扩展用于生成文本对抗样本。在该框架中,连续优化的扰动被添加至嵌入层,并在前向传播过程中被放大。随后,受扰动的潜在表示通过掩码语言模型头部进行解码,从而获得潜在的对抗样本。本文通过一种名为文本投影梯度下降(T-PGD)的攻击算法实例化该框架。我们发现,即使在利用代理梯度信息的情况下,该算法依然有效。因此,我们进行了更具挑战性的迁移黑盒攻击,并在三个基准数据集上使用多种模型开展了全面实验以评估我们的攻击算法。实验结果表明,与强基线方法相比,我们的方法取得了整体更优的性能,并能生成更流畅、语法更自然的对抗样本。代码和数据已公开于 \url{https://github.com/Phantivia/T-PGD}。