Neural Machine Translation (NMT) models have been shown to be vulnerable to adversarial attacks, wherein carefully crafted perturbations of the input can mislead the target model. In this paper, we introduce ACT, a novel adversarial attack framework against NMT systems guided by a classifier. In our attack, the adversary aims to craft meaning-preserving adversarial examples whose translations in the target language by the NMT model belong to a different class than the original translations. Unlike previous attacks, our new approach has a more substantial effect on the translation by altering the overall meaning, which then leads to a different class determined by an oracle classifier. To evaluate the robustness of NMT models to our attack, we propose enhancements to existing black-box word-replacement-based attacks by incorporating output translations of the target NMT model and the output logits of a classifier within the attack process. Extensive experiments, including a comparison with existing untargeted attacks, show that our attack is considerably more successful in altering the class of the output translation and has more effect on the translation. This new paradigm can reveal the vulnerabilities of NMT systems by focusing on the class of translation rather than the mere translation quality as studied traditionally.
翻译:神经机器翻译(NMT)模型已被证明容易受到对抗攻击,即对输入进行精心设计的扰动可误导目标模型。本文提出ACT——一种由分类器引导的、针对NMT系统的新型对抗攻击框架。在该攻击中,攻击者旨在构造保持语义的对抗样本,使得NMT模型将其翻译为目标语言后所得译文与原始译文所属类别不同。与以往攻击不同,本方法通过改变整体语义对翻译产生更实质性影响,进而使译文被判定为与原始译文不同的类别。为评估NMT模型对此类攻击的鲁棒性,我们提出对现有基于黑盒词替换的攻击方法进行改进,在攻击过程中整合目标NMT模型的输出译文和分类器的输出逻辑值。大量实验(包括与现有非定向攻击的对比)表明,本方法在改变输出译文类别方面成功率显著更高,且对翻译影响更大。这种新范式通过聚焦译文类别而非传统研究关注的单纯翻译质量,可揭示NMT系统的脆弱性。