Large-scale language models achieved state-of-the-art performance over a number of language tasks. However, they fail on adversarial language examples, which are sentences optimized to fool the language models but with similar semantic meanings for humans. While prior work focuses on making the language model robust at training time, retraining for robustness is often unrealistic for large-scale foundation models. Instead, we propose to make the language models robust at test time. By dynamically adapting the input sentence with predictions from masked words, we show that we can reverse many language adversarial attacks. Since our approach does not require any training, it works for novel tasks at test time and can adapt to novel adversarial corruptions. Visualizations and empirical results on two popular sentence classification datasets demonstrate that our method can repair adversarial language attacks over 65% o
翻译:大规模语言模型在多项语言任务上取得了最先进的性能。然而,它们在对抗性语言样本(即经过优化以欺骗语言模型但保留与人类相近语义的句子)上表现不佳。先前的工作主要集中于在训练阶段增强语言模型的鲁棒性,但对于大规模基础模型而言,重新训练以实现鲁棒性往往不切实际。为此,我们提出在测试时增强语言模型的鲁棒性。通过利用掩码词的预测结果动态调整输入句子,我们证明可以逆转多种语言对抗攻击。由于我们的方法无需任何训练,因此能够适用于测试时的新任务,并适应新型对抗性破坏。在两个常用句子分类数据集上的可视化结果与实证研究表明,我们的方法可以修复超过65%的语言对抗攻击。