Adversarial training is one of the best-performing methods in improving the robustness of deep language models. However, robust models come at the cost of high time consumption, as they require multi-step gradient ascents or word substitutions to obtain adversarial samples. In addition, these generated samples are deficient in grammatical quality and semantic consistency, which impairs the effectiveness of adversarial training. To address these problems, we introduce a novel, effective procedure for instead adversarial training with only clean data. Our procedure, distribution shift risk minimization (DSRM), estimates the adversarial loss by perturbing the input data's probability distribution rather than their embeddings. This formulation results in a robust model that minimizes the expected global loss under adversarial attacks. Our approach requires zero adversarial samples for training and reduces time consumption by up to 70\% compared to current best-performing adversarial training methods. Experiments demonstrate that DSRM considerably improves BERT's resistance to textual adversarial attacks and achieves state-of-the-art robust accuracy on various benchmarks.
翻译:对抗训练是提升深度语言模型鲁棒性的最佳方法之一。然而,鲁棒模型往往以高时间消耗为代价,因为需要多步梯度上升或词语替换来获取对抗样本。此外,这些生成的样本存在语法质量和语义一致性的缺陷,从而损害了对抗训练的有效性。为解决这些问题,我们提出了一种新颖且高效的方法,仅使用干净数据即可替代对抗训练。该方法名为分布偏移风险最小化(DSRM),通过扰动输入数据的概率分布而非其嵌入表示来估计对抗损失。这种形式化方法使得鲁棒模型能够最小化对抗攻击下的期望全局损失。我们的方法在训练过程中无需任何对抗样本,且相比当前最优的对抗训练方法,时间消耗可降低高达70%。实验表明,DSRM显著提升了BERT对文本对抗攻击的抵抗力,并在多个基准测试中达到了最先进的鲁棒准确率。