Pre-trained language models (PLMs) are known to improve the generalization performance of natural language understanding models by leveraging large amounts of data during the pre-training phase. However, the out-of-distribution (OOD) generalization problem remains a challenge in many NLP tasks, limiting the real-world deployment of these methods. This paper presents the first attempt at creating a unified benchmark named GLUE-X for evaluating OOD robustness in NLP models, highlighting the importance of OOD robustness and providing insights on how to measure the robustness of a model and how to improve it. The benchmark includes 13 publicly available datasets for OOD testing, and evaluations are conducted on 8 classic NLP tasks over 21 popularly used PLMs, including GPT-3 and GPT-3.5. Our findings confirm the need for improved OOD accuracy in NLP tasks, as significant performance degradation was observed in all settings compared to in-distribution (ID) accuracy.
翻译:预训练语言模型(PLMs)通过在预训练阶段利用大量数据,已知能够提升自然语言理解模型的泛化性能。然而,分布外(OOD)泛化问题仍是许多NLP任务中的挑战,限制了这些方法在实际场景中的部署。本文首次尝试构建名为GLUE-X的统一基准,用于评估NLP模型的OOD鲁棒性,强调OOD鲁棒性的重要性,并提供衡量模型鲁棒性及提升鲁棒性的见解。该基准包含13个公开数据集用于OOD测试,并在8项经典NLP任务上对21种常用PLMs(包括GPT-3和GPT-3.5)进行评估。我们的发现证实了在NLP任务中提升OOD准确率的必要性,因为与分布内(ID)准确率相比,所有设置下均观察到显著的性能下降。