Deep neural networks have been shown to be vulnerable to small perturbations of their inputs, known as adversarial attacks. In this paper, we investigate the vulnerability of Neural Machine Translation (NMT) models to adversarial attacks and propose a new attack algorithm called TransFool. To fool NMT models, TransFool builds on a multi-term optimization problem and a gradient projection step. By integrating the embedding representation of a language model, we generate fluent adversarial examples in the source language that maintain a high level of semantic similarity with the clean samples. Experimental results demonstrate that, for different translation tasks and NMT architectures, our white-box attack can severely degrade the translation quality while the semantic similarity between the original and the adversarial sentences stays high. Moreover, we show that TransFool is transferable to unknown target models. Finally, based on automatic and human evaluations, TransFool leads to improvement in terms of success rate, semantic similarity, and fluency compared to the existing attacks both in white-box and black-box settings. Thus, TransFool permits us to better characterize the vulnerability of NMT models and outlines the necessity to design strong defense mechanisms and more robust NMT systems for real-life applications.
翻译:深度神经网络已被证明容易受到输入微小扰动的影响,即对抗性攻击。本文研究了神经机器翻译(NMT)模型对对抗性攻击的脆弱性,并提出了一种名为TransFool的新型攻击算法。为了欺骗NMT模型,TransFool基于多目标优化问题和梯度投影步骤构建。通过集成语言模型的嵌入表示,我们在源语言中生成流畅的对抗性样本,这些样本与干净样本保持高度语义相似性。实验结果表明,针对不同的翻译任务和NMT架构,我们的白盒攻击能够严重降低翻译质量,同时原始句子与对抗句子之间的语义相似性保持较高水平。此外,我们证明TransFool可迁移至未知目标模型。最后,基于自动评估和人工评估,TransFool在白盒和黑盒场景下相比现有攻击在成功率、语义相似性和流畅性方面均有提升。因此,TransFool使我们能够更好地刻画NMT模型的脆弱性,并凸显了为实际应用设计强健防御机制和更鲁棒NMT系统的必要性。