Machine learning models often perform poorly on subgroups that are underrepresented in the training data. Yet, little is understood on the variation in mechanisms that cause subpopulation shifts, and how algorithms generalize across such diverse shifts at scale. In this work, we provide a fine-grained analysis of subpopulation shift. We first propose a unified framework that dissects and explains common shifts in subgroups. We then establish a comprehensive benchmark of 20 state-of-the-art algorithms evaluated on 12 real-world datasets in vision, language, and healthcare domains. With results obtained from training over 10,000 models, we reveal intriguing observations for future progress in this space. First, existing algorithms only improve subgroup robustness over certain types of shifts but not others. Moreover, while current algorithms rely on group-annotated validation data for model selection, we find that a simple selection criterion based on worst-class accuracy is surprisingly effective even without any group information. Finally, unlike existing works that solely aim to improve worst-group accuracy (WGA), we demonstrate the fundamental tradeoff between WGA and other important metrics, highlighting the need to carefully choose testing metrics. Code and data are available at: https://github.com/YyzHarry/SubpopBench.
翻译:机器学习模型常对训练数据中代表性不足的子群体表现不佳。然而,学界对导致子群体迁移的机制差异以及算法如何在多种大规模迁移中泛化的认识仍十分有限。本研究对子群体迁移进行了细粒度分析。我们首先提出一个统一框架,用以剖析和解释常见的子群体迁移现象。随后,我们建立了包含20种最新算法的综合基准测试,并在视觉、语言和医疗领域的12个真实世界数据集上进行评估。基于对超过1万个模型的训练结果,我们揭示了该领域未来进展中值得关注的现象:第一,现有算法仅能针对特定类型的迁移提升子群体鲁棒性,而非所有类型;第二,尽管当前算法依赖群体标注的验证数据进行模型选择,但我们发现基于最差类别准确率的简单选择标准即使无任何群体信息也能取得惊人效果;最后,与现有仅致力于提升最差群体准确率(WGA)的研究不同,我们证明了WGA与其他重要指标之间存在根本性权衡,突显了审慎选择测试指标的必要性。代码与数据已发布于:https://github.com/YyzHarry/SubpopBench