Adversarial training is widely acknowledged as the most effective defense against adversarial attacks. However, it is also well established that achieving both robustness and generalization in adversarially trained models involves a trade-off. The goal of this work is to provide an in depth comparison of different approaches for adversarial training in language models. Specifically, we study the effect of pre-training data augmentation as well as training time input perturbations vs. embedding space perturbations on the robustness and generalization of transformer-based language models. Our findings suggest that better robustness can be achieved by pre-training data augmentation or by training with input space perturbation. However, training with embedding space perturbation significantly improves generalization. A linguistic correlation analysis of neurons of the learned models reveals that the improved generalization is due to 'more specialized' neurons. To the best of our knowledge, this is the first work to carry out a deep qualitative analysis of different methods of generating adversarial examples in adversarial training of language models.
翻译:对抗训练被广泛认为是对抗攻击最有效的防御手段。然而,研究表明,在对抗训练模型中同时实现鲁棒性与泛化性需要权衡取舍。本文旨在深入比较语言模型中不同对抗训练方法的效果。具体而言,我们研究了预训练数据增强、训练时输入扰动与嵌入空间扰动对基于Transformer的语言模型鲁棒性与泛化能力的影响。实验结果表明:通过预训练数据增强或输入空间扰动训练可提升模型鲁棒性,而嵌入空间扰动训练则显著改善泛化性能。对学习模型神经元的语言相关性分析显示,泛化能力的提升源于“更特化”的神经元。据我们所知,这是首次对语言模型对抗训练中生成对抗样本的不同方法进行深度定性分析的研究。