Recently it has been shown that state-of-the-art NLP models are vulnerable to adversarial attacks, where the predictions of a model can be drastically altered by slight modifications to the input (such as synonym substitutions). While several defense techniques have been proposed, and adapted, to the discrete nature of text adversarial attacks, the benefits of general-purpose regularization methods such as label smoothing for language models, have not been studied. In this paper, we study the adversarial robustness provided by various label smoothing strategies in foundational models for diverse NLP tasks in both in-domain and out-of-domain settings. Our experiments show that label smoothing significantly improves adversarial robustness in pre-trained models like BERT, against various popular attacks. We also analyze the relationship between prediction confidence and robustness, showing that label smoothing reduces over-confident errors on adversarial examples.
翻译:近期研究表明,最先进的自然语言处理模型易受对抗攻击的影响——对输入进行微小修改(如同义词替换)即可显著改变模型预测结果。尽管已有多种针对文本对抗攻击离散特性的防御技术被提出并改进,但通用正则化方法(如语言模型中的标签平滑)的益处尚未得到充分研究。本文系统探究了在多种自然语言处理任务的基础模型中,不同标签平滑策略在领域内与跨领域场景下提供的对抗鲁棒性。实验表明,标签平滑能显著提升BERT等预训练模型针对多种主流攻击的对抗鲁棒性。我们还分析了预测置信度与鲁棒性之间的关系,揭示标签平滑可减少对抗样本上的过度自信错误。