Revisiting Out-of-distribution Robustness in NLP: Benchmark, Analysis, and LLMs Evaluations

This paper reexamines the research on out-of-distribution (OOD) robustness in the field of NLP. We find that the distribution shift settings in previous studies commonly lack adequate challenges, hindering the accurate evaluation of OOD robustness. To address these issues, we propose a benchmark construction protocol that ensures clear differentiation and challenging distribution shifts. Then we introduce BOSS, a Benchmark suite for Out-of-distribution robustneSS evaluation covering 5 tasks and 20 datasets. Based on BOSS, we conduct a series of experiments on pre-trained language models for analysis and evaluation of OOD robustness. First, for vanilla fine-tuning, we examine the relationship between in-distribution (ID) and OOD performance. We identify three typical types that unveil the inner learning mechanism, which could potentially facilitate the forecasting of OOD robustness, correlating with the advancements on ID datasets. Then, we evaluate 5 classic methods on BOSS and find that, despite exhibiting some effectiveness in specific cases, they do not offer significant improvement compared to vanilla fine-tuning. Further, we evaluate 5 LLMs with various adaptation paradigms and find that when sufficient ID data is available, fine-tuning domain-specific models outperform LLMs on ID examples significantly. However, in the case of OOD instances, prioritizing LLMs with in-context learning yields better results. We identify that both fine-tuned small models and LLMs face challenges in effectively addressing downstream tasks. The code is public at \url{https://github.com/lifan-yuan/OOD_NLP}.

翻译：本文重新审视了自然语言处理领域中分布外鲁棒性的研究。我们发现先前研究中的分布偏移设置普遍缺乏足够的挑战性，阻碍了对分布外鲁棒性的准确评估。为解决这些问题，我们提出了一种确保清晰区分和具有挑战性分布偏移的基准构建协议。随后我们推出了BOSS（分布外鲁棒性评估基准套件），覆盖5个任务和20个数据集。基于BOSS，我们对预训练语言模型开展了一系列实验以分析评估其分布外鲁棒性。首先，针对标准微调，我们考察了分布内与分布外性能之间的关系，识别出三种典型类型，揭示了内在学习机制，这将有助于预测分布外鲁棒性，并与分布内数据集上的进展相关联。其次，我们在BOSS上评估了5种经典方法，发现尽管在特定案例中表现出一定效果，但相比标准微调并未带来显著提升。进一步地，我们评估了采用不同适应范式的5种大语言模型，发现当拥有充足分布内数据时，领域特定模型的微调在分布内示例上显著优于大语言模型。但在处理分布外实例时，优先采用结合上下文学习的大语言模型能取得更优结果。我们识别出微调的小型模型与大语言模型在有效解决下游任务时均面临挑战。代码已开源至\url{https://github.com/lifan-yuan/OOD_NLP}。