Existing research on Domain Robustness (DR) suffers from disparate setups, lack of evaluation task variety, and reliance on challenge sets. In this paper, we pose a fundamental question: What is the state of affairs of the DR challenge in the era of Large Language Models (LLMs)? To this end, we construct a DR benchmark comprising diverse NLP tasks, including sentence and token-level classification, QA, and generation, each task consists of several domains. We explore the DR challenge of fine-tuned and few-shot learning models in natural domain shift settings and devise two diagnostic metrics of Out-of-Distribution (OOD) performance degradation: The commonly used Source Drop (SD) and the overlooked Target Drop (TD). Our findings reveal important insights: First, despite their capabilities, zero-to-few shot LLMs and fine-tuning approaches still fail to meet satisfactory performance in the OOD context; Second, TD approximates better than SD the average OOD degradation; Third, in a significant proportion of domain shifts, either SD or TD is positive, but not both, and therefore disregarding one can lead to incorrect DR conclusions.
翻译:现有关于领域鲁棒性(DR)的研究存在实验设置不统一、评估任务多样性不足以及对挑战集的依赖等问题。本文提出一个根本性问题:在大语言模型(LLMs)时代,域鲁棒性挑战的现状如何?为此,我们构建了一个包含多种自然语言处理任务的DR基准,涵盖句子级与词元级分类、问答(QA)以及生成任务,每个任务包含多个领域。我们探究了微调模型和少样本学习模型在自然领域迁移场景下的DR挑战,并设计了两类用于衡量分布外(OOD)性能下降的诊断指标:常用的源域下降(SD)和常被忽略的目标域下降(TD)。研究发现揭示了重要结论:首先,尽管零样本到少样本的LLMs和微调方法能力强大,但在OOD场景下仍未能达到令人满意的性能;其次,TD比SD更能逼近平均OOD性能下降;第三,在相当比例的领域迁移中,SD或TD仅有一项为正,因此忽略其中一项可能导致错误的DR结论。