Existing research on Domain Robustness (DR) suffers from disparate setups, lack of task variety, and scarce research on recent models and capabilities such as few-shot learning. Furthermore, we claim that the common practice of measuring DR might further obscure the picture. Current research focuses on challenge sets and relies solely on the Source Drop (SD): Using the source in-domain performance as a reference point for degradation. However, the Target Drop (TD) should be used as a complementary point of view. To understand the DR challenge in modern NLP models, we developed a benchmark comprised of seven NLP tasks, including classification, QA, and generation. Our benchmark focuses on natural topical domain shifts and enables measuring both the SD and the TD. Our comprehensive study, involving over 14,000 domain shifts across 18 fine-tuned and few-shot models, shows that both models suffer from drops upon domain shifts. While fine-tuned models excel in-domain, few-shot LLMs often surpass them cross-domain, showing better robustness. In addition, we found that a large SD can be explained by shifting to a harder domain rather than a genuine DR challenge. Thus, the TD is a more reliable metric.
翻译:现有领域鲁棒性研究存在设置不统一、任务多样性不足,以及对近期模型与少样本学习等能力的探索匮乏等问题。此外,我们认为当前衡量领域鲁棒性的常见做法可能进一步模糊了问题本质。现有研究聚焦于挑战性数据集,仅依赖源域下降指标:以源域内性能作为退化基准点。然而,目标域下降应作为互补视角纳入考量。为理解现代自然语言处理模型面临的领域鲁棒性挑战,我们构建了涵盖分类、问答和生成等七项自然语言处理任务的基准测试集。该基准聚焦于自然的主题领域偏移,并支持同时测量源域下降与目标域下降。通过涉及超过14,000次领域偏移(涵盖18种微调模型与少样本模型)的综合研究发现:两类模型在领域偏移后均出现性能下降。尽管微调模型在域内表现优异,少样本大语言模型在跨域场景中往往更胜一筹,展现出更强的鲁棒性。此外我们发现,较大的源域下降可归因于迁移至更困难的领域而非真正的领域鲁棒性挑战。因此,目标域下降是更可靠的评估指标。