Existing research on Domain Robustness (DR) suffers from disparate setups, lack of task variety, and scarce research on recent capabilities such as few-shot learning. Furthermore, we claim that the common practice of measuring DR might further obscure the picture. Current research focuses on challenge sets and relies solely on the Source Drop (SD): Using the source in-domain performance as a reference point for degradation. However, the Target Drop (TD), which measures degradation from the target in-domain performance, should be used as a complementary point of view. In this study, we developed a benchmark comprised of seven NLP tasks, including classification, QA, and generation. Our benchmark focuses on natural topical domain shifts and enables measuring both the SD and the TD. Our comprehensive study, involving over 14,000 domain shifts across 18 fine-tuned and few-shot models, shows that both model types suffer from drops upon domain shifts. While fine-tuned models excel in-domain, few-shot LLMs often surpass them cross-domain, showing better robustness. In addition, we found that a large SD can be explained by shifting to a harder domain rather than by a genuine DR challenge. Thus, the TD is a more reliable metric for assessing DR.
翻译:现有关于领域鲁棒性(DR)的研究存在评估体系不统一、任务类型缺乏多样性,以及对少样本学习等新近能力研究不足的问题。此外,我们认为当前衡量DR的常见做法可能进一步模糊了评估全貌。现有研究聚焦于挑战集,并仅依赖源域下降(SD):即将源域内性能作为评估性能退化的基准。然而,目标域下降(TD)——即从目标域内性能出发衡量性能衰减——应当作为补充视角纳入评估。本研究构建了一个涵盖分类、问答与生成等七项NLP任务的基准测试集。该基准聚焦于自然主题领域迁移,并支持同时量化SD与TD指标。我们通过涵盖18个微调模型与少样本模型、涉及超过14000个领域迁移场景的全面研究表明,两类模型在领域迁移时均出现性能下降。虽然微调模型在源域表现优异,但少样本大语言模型(LLMs)在跨域场景中往往更胜一筹,展现出更强的鲁棒性。此外,研究发现较大的SD值可能源于迁移至更难领域,而非真正的DR挑战。因此,TD是评估DR更可靠的度量指标。