Machine learning models often perform poorly on subgroups that are underrepresented in the training data. Yet, little is understood on the variation in mechanisms that cause subpopulation shifts, and how algorithms generalize across such diverse shifts at scale. In this work, we provide a fine-grained analysis of subpopulation shift. We first propose a unified framework that dissects and explains common shifts in subgroups. We then establish a comprehensive benchmark of 20 state-of-the-art algorithms evaluated on 12 real-world datasets in vision, language, and healthcare domains. With results obtained from training over 10,000 models, we reveal intriguing observations for future progress in this space. First, existing algorithms only improve subgroup robustness over certain types of shifts but not others. Moreover, while current algorithms rely on group-annotated validation data for model selection, we find that a simple selection criterion based on worst-class accuracy is surprisingly effective even without any group information. Finally, unlike existing works that solely aim to improve worst-group accuracy (WGA), we demonstrate the fundamental tradeoff between WGA and other important metrics, highlighting the need to carefully choose testing metrics. Code and data are available at: https://github.com/YyzHarry/SubpopBench.
翻译:机器学习模型在训练数据中代表性不足的子群体上往往表现不佳。然而,人们对导致子群体偏移的机制变化以及算法如何大规模泛化跨越这些多样化偏移的理解仍十分有限。这项工作中,我们对子群体偏移进行了细粒度分析。我们首先提出了一个统一框架,用于解析和解释子群体中常见的偏移。接着,我们建立了一个全面的基准,涵盖20种最先进的算法,并在视觉、语言和医疗领域的12个真实世界数据集上进行了评估。基于训练超过10,000个模型获得的结果,我们揭示了这一领域中未来进展的引人注目的观察。首先,现有算法仅能改善特定类型偏移下的子群体鲁棒性,而非其他类型。此外,尽管当前算法依赖于带有群体标注的验证数据进行模型选择,我们发现基于最差类别准确率的简单选择标准即使在没有群体信息的情况下也异常有效。最后,与现有工作仅旨在改善最差群体准确率(WGA)不同,我们证明了WGA与其他重要指标之间的基本权衡,强调了仔细选择测试指标的必要性。代码和数据可在https://github.com/YyzHarry/SubpopBench获取。