Language Models today provide a high accuracy across a large number of downstream tasks. However, they remain susceptible to adversarial attacks, particularly against those where the adversarial examples maintain considerable similarity to the original text. Given the multilingual nature of text, the effectiveness of adversarial examples across translations and how machine translations can improve the robustness of adversarial examples remain largely unexplored. In this paper, we present a comprehensive study on the robustness of current text adversarial attacks to round-trip translation. We demonstrate that 6 state-of-the-art text-based adversarial attacks do not maintain their efficacy after round-trip translation. Furthermore, we introduce an intervention-based solution to this problem, by integrating Machine Translation into the process of adversarial example generation and demonstrating increased robustness to round-trip translation. Our results indicate that finding adversarial examples robust to translation can help identify the insufficiency of language models that is common across languages, and motivate further research into multilingual adversarial attacks.
翻译:当前语言模型在大量下游任务中展现出高准确率。然而,它们仍然容易受到对抗攻击的影响,尤其是那些使对抗样本与原始文本保持显著相似性的攻击。鉴于文本的多语言特性,对抗样本在不同翻译中的有效性以及机器翻译如何增强对抗样本的鲁棒性等问题仍未得到充分探索。本文对当前文本对抗攻击在往返翻译下的鲁棒性进行了全面研究。我们证明,6种最先进的基于文本的对抗攻击在经历往返翻译后无法保持其有效性。此外,我们通过将机器翻译整合到对抗样本生成过程中,提出了一种基于干预的解决方案,并证明该方案能增强对抗样本对往返翻译的鲁棒性。我们的研究结果表明,寻找对翻译鲁棒的对抗样本有助于识别语言模型中跨语言共有的不足,并推动多语言对抗攻击的进一步研究。