In this paper, we propose an optimization-based adversarial attack against Neural Machine Translation (NMT) models. First, we propose an optimization problem to generate adversarial examples that are semantically similar to the original sentences but destroy the translation generated by the target NMT model. This optimization problem is discrete, and we propose a continuous relaxation to solve it. With this relaxation, we find a probability distribution for each token in the adversarial example, and then we can generate multiple adversarial examples by sampling from these distributions. Experimental results show that our attack significantly degrades the translation quality of multiple NMT models while maintaining the semantic similarity between the original and adversarial sentences. Furthermore, our attack outperforms the baselines in terms of success rate, similarity preservation, effect on translation quality, and token error rate. Finally, we propose a black-box extension of our attack by sampling from an optimized probability distribution for a reference model whose gradients are accessible.
翻译:本文提出了一种基于优化的对抗攻击方法,用于攻击神经机器翻译(NMT)模型。首先,我们构建了一个优化问题,旨在生成与原始句子语义相似但能破坏目标NMT模型翻译结果的对抗样本。该优化问题具有离散性,为此我们提出一种连续松弛方法进行求解。通过这种松弛,我们为对抗样本中的每个词元获取一个概率分布,进而可从这些分布中采样生成多个对抗样本。实验结果表明,我们的攻击方法在保持原始句子与对抗句子语义相似性的同时,显著降低了多个NMT模型的翻译质量。此外,该方法在成功率、语义保持度、翻译质量影响及词元错误率等指标上均优于基线方法。最后,我们提出了一种黑盒扩展攻击,通过从梯度可访问的参考模型优化概率分布中进行采样来实现。