Neural Machine Translation (NMT) models have been shown to be vulnerable to adversarial attacks, wherein carefully crafted perturbations of the input can mislead the target model. In this paper, we introduce ACT, a novel adversarial attack framework against NMT systems guided by a classifier. In our attack, the adversary aims to craft meaning-preserving adversarial examples whose translations by the NMT model belong to a different class than the original translations in the target language. Unlike previous attacks, our new approach has a more substantial effect on the translation by altering the overall meaning, which leads to a different class determined by a classifier. To evaluate the robustness of NMT models to this attack, we propose enhancements to existing black-box word-replacement-based attacks by incorporating output translations of the target NMT model and the output logits of a classifier within the attack process. Extensive experiments in various settings, including a comparison with existing untargeted attacks, demonstrate that the proposed attack is considerably more successful in altering the class of the output translation and has more effect on the translation. This new paradigm can show the vulnerabilities of NMT systems by focusing on the class of translation rather than the mere translation quality as studied traditionally.
翻译:神经机器翻译(NMT)模型已被证明易受对抗攻击,即通过精心设计的输入扰动能够误导目标模型。本文提出ACT,一种新颖的、由分类器指导的NMT系统对抗攻击框架。在该攻击中,攻击者的目标是构造保持语义的对抗样本,使得NMT模型对其的译文在目标语言中与原始译文属于不同类别。与以往攻击不同,我们的新方法通过改变整体语义对译文产生实质性影响,从而生成由分类器判定的不同类别译文。为评估NMT模型对该攻击的鲁棒性,我们在现有基于单词替换的黑盒攻击基础上提出改进方案,将目标NMT模型的输出译文与分类器的输出logits融入攻击过程。在不同设置下(包括与现有非定向攻击的对比)进行的大量实验表明,所提攻击在改变输出译文类别方面具有显著更高的成功率,并对译文产生更显著的影响。这种新范式通过聚焦译文类别而非传统研究的单纯译文质量,揭示了NMT系统的脆弱性。