On the nonlinear correlation of ML performance between data subpopulations

Understanding the performance of machine learning (ML) models across diverse data distributions is critically important for reliable applications. Despite recent empirical studies positing a near-perfect linear correlation between in-distribution (ID) and out-of-distribution (OOD) accuracies, we empirically demonstrate that this correlation is more nuanced under subpopulation shifts. Through rigorous experimentation and analysis across a variety of datasets, models, and training epochs, we demonstrate that OOD performance often has a nonlinear correlation with ID performance in subpopulation shifts. Our findings, which contrast previous studies that have posited a linear correlation in model performance during distribution shifts, reveal a "moon shape" correlation (parabolic uptrend curve) between the test performance on the majority subpopulation and the minority subpopulation. This non-trivial nonlinear correlation holds across model architectures, hyperparameters, training durations, and the imbalance between subpopulations. Furthermore, we found that the nonlinearity of this "moon shape" is causally influenced by the degree of spurious correlations in the training data. Our controlled experiments show that stronger spurious correlation in the training data creates more nonlinear performance correlation. We provide complementary experimental and theoretical analyses for this phenomenon, and discuss its implications for ML reliability and fairness. Our work highlights the importance of understanding the nonlinear effects of model improvement on performance in different subpopulations, and has the potential to inform the development of more equitable and responsible machine learning models.

翻译：理解机器学习（ML）模型在不同数据分布下的性能表现对于可靠应用至关重要。尽管近期实证研究提出分布内（ID）与分布外（OOD）准确率之间存在近乎完美的线性相关，但我们通过实验证明，在子群体偏移情境下这种相关性更为复杂。通过对多种数据集、模型及训练阶段的严格实验与分析，我们发现子群体偏移中OOD性能与ID性能常呈现非线性相关。这一发现与先前认为分布偏移时模型性能存在线性相关的结论形成对比，揭示了多数子群体与少数子群体测试性能间的“月牙形”相关性（抛物线上升曲线）。这种非平凡的非线性相关性在模型架构、超参数、训练时长及子群体间不平衡性下均保持稳定。此外，我们证实训练数据中虚假相关性的程度因果性地影响这种“月牙形”非线性程度。控制实验表明，训练数据中越强的虚假相关性会催生越强的非线性性能相关。我们对该现象提供了互补的实验与理论分析，并讨论其对ML可靠性与公平性的启示。本研究凸显了理解模型改进对不同子群体性能非线性影响的重要性，有望为开发更公平、负责任的机器学习模型提供参考。