Despite recent success on various tasks, deep learning techniques still perform poorly on adversarial examples with small perturbations. While optimization-based methods for adversarial attacks are well-explored in the field of computer vision, it is impractical to directly apply them in natural language processing due to the discrete nature of the text. To address the problem, we propose a unified framework to extend the existing optimization-based adversarial attack methods in the vision domain to craft textual adversarial samples. In this framework, continuously optimized perturbations are added to the embedding layer and amplified in the forward propagation process. Then the final perturbed latent representations are decoded with a masked language model head to obtain potential adversarial samples. In this paper, we instantiate our framework with an attack algorithm named Textual Projected Gradient Descent (T-PGD). We find our algorithm effective even using proxy gradient information. Therefore, we perform the more challenging transfer black-box attack and conduct comprehensive experiments to evaluate our attack algorithm with several models on three benchmark datasets. Experimental results demonstrate that our method achieves an overall better performance and produces more fluent and grammatical adversarial samples compared to strong baseline methods. All the code and data will be made public.
翻译:尽管深度学习技术在各种任务上取得了近期成功,但在面对微小扰动的对抗样本时仍表现不佳。虽然基于优化的对抗攻击方法在计算机视觉领域已得到充分探索,但由于文本的离散特性,直接将其应用于自然语言处理并不现实。为解决这一问题,我们提出一个统一框架,将视觉领域中现有的基于优化的对抗攻击方法扩展到文本对抗样本的生成中。在该框架中,连续优化的扰动被添加到嵌入层,并在前向传播过程中被放大。随后,受扰动的最终潜在表示通过掩码语言模型头进行解码,以获取潜在的对抗样本。本文中,我们通过一种名为文本投影梯度下降(T-PGD)的攻击算法来实例化该框架。我们发现,即使使用代理梯度信息,该算法依然有效。因此,我们进行了更具挑战性的迁移黑盒攻击,并在三个基准数据集上使用多个模型开展了全面实验以评估我们的攻击算法。实验结果表明,与强基线方法相比,我们的方法实现了整体更优的性能,并生成了更流畅且符合语法的对抗样本。所有代码和数据将公开发布。