Pre-trained language models (PLMs) are known to improve the generalization performance of natural language understanding models by leveraging large amounts of data during the pre-training phase. However, the out-of-distribution (OOD) generalization problem remains a challenge in many NLP tasks, limiting the real-world deployment of these methods. This paper presents the first attempt at creating a unified benchmark named \method for evaluating OOD robustness in NLP models, highlighting the importance of OOD robustness and providing insights on how to measure the robustness of a model and how to improve it. The benchmark includes 13 publicly available datasets for OOD testing, and evaluations are conducted on 8 classic NLP tasks over 21 popularly used PLMs, including GPT-3 and GPT-3.5. Our findings confirm the need for improved OOD accuracy in NLP tasks, as significant performance degradation was observed in all settings compared to in-distribution (ID) accuracy.
翻译:预训练语言模型(PLMs)通过在预训练阶段利用大量数据来提升自然语言理解模型的泛化性能。然而,分布外(OOD)泛化问题在诸多自然语言处理任务中仍是一大挑战,制约了这些方法在实际场景中的部署。本文首次尝试构建统一基准\method,用于评估NLP模型的OOD鲁棒性,强调了OOD鲁棒性的重要性,并深入探讨如何衡量及提升模型的鲁棒性。该基准涵盖13个公开可用的OOD测试数据集,在8个经典NLP任务上对21种广泛使用的PLMs(包括GPT-3和GPT-3.5)进行了评估。研究结果表明,与分布内(ID)准确率相比,所有设置下的OOD准确率均出现显著下降,证实了提升NLP任务OOD准确率的必要性。