Language models (LMs) are indispensable tools for natural language processing tasks, but their vulnerability to adversarial attacks remains a concern. While current research has explored adversarial training techniques, their improvements to defend against word-level attacks have been limited. In this work, we propose a novel approach called Semantic Robust Defence (SemRoDe), a Macro Adversarial Training strategy to enhance the robustness of LMs. Drawing inspiration from recent studies in the image domain, we investigate and later confirm that in a discrete data setting such as language, adversarial samples generated via word substitutions do indeed belong to an adversarial domain exhibiting a high Wasserstein distance from the base domain. Our method learns a robust representation that bridges these two domains. We hypothesize that if samples were not projected into an adversarial domain, but instead to a domain with minimal shift, it would improve attack robustness. We align the domains by incorporating a new distance-based objective. With this, our model is able to learn more generalized representations by aligning the model's high-level output features and therefore better handling unseen adversarial samples. This method can be generalized across word embeddings, even when they share minimal overlap at both vocabulary and word-substitution levels. To evaluate the effectiveness of our approach, we conduct experiments on BERT and RoBERTa models on three datasets. The results demonstrate promising state-of-the-art robustness.
翻译:语言模型(LMs)是自然语言处理任务中不可或缺的工具,但其对对抗性攻击的脆弱性仍是一个值得关注的问题。尽管当前研究已探索了对抗训练技术,但这些技术在防御词汇级攻击方面的改进仍有限。本文提出一种名为语义鲁棒防御(SemRoDe)的新方法,这是一种宏观对抗训练策略,旨在增强语言模型的鲁棒性。受图像领域近期研究的启发,我们研究并随后证实:在语言这类离散数据场景中,通过词汇替换生成的对抗样本确实属于与基域具有高Wasserstein距离的对抗域。我们的方法学习一种连接这两个域的鲁棒表示。我们假设:若样本未被投影到对抗域,而是投影到最小偏移域,则可提升攻击鲁棒性。通过引入新的基于距离的目标函数,我们对齐各域。借助此方法,模型通过对齐高层输出特征学习更泛化的表示,从而更好地处理未见过的对抗样本。该方法可泛化至不同词嵌入,即使它们在词汇和词汇替换层面共享极少的重叠。为评估方法的有效性,我们在三个数据集上基于BERT和RoBERTa模型进行实验。结果表明,该方法达到了具有竞争力的最新鲁棒性水平。