Accuracy on the Curve: On the Nonlinear Correlation of ML Performance Between Data Subpopulations

Understanding the performance of machine learning (ML) models across diverse data distributions is critically important for reliable applications. Despite recent empirical studies positing a near-perfect linear correlation between in-distribution (ID) and out-of-distribution (OOD) accuracies, we empirically demonstrate that this correlation is more nuanced under subpopulation shifts. Through rigorous experimentation and analysis across a variety of datasets, models, and training epochs, we demonstrate that OOD performance often has a nonlinear correlation with ID performance in subpopulation shifts. Our findings, which contrast previous studies that have posited a linear correlation in model performance during distribution shifts, reveal a "moon shape" correlation (parabolic uptrend curve) between the test performance on the majority subpopulation and the minority subpopulation. This non-trivial nonlinear correlation holds across model architectures, hyperparameters, training durations, and the imbalance between subpopulations. Furthermore, we found that the nonlinearity of this "moon shape" is causally influenced by the degree of spurious correlations in the training data. Our controlled experiments show that stronger spurious correlation in the training data creates more nonlinear performance correlation. We provide complementary experimental and theoretical analyses for this phenomenon, and discuss its implications for ML reliability and fairness. Our work highlights the importance of understanding the nonlinear effects of model improvement on performance in different subpopulations, and has the potential to inform the development of more equitable and responsible machine learning models.

翻译：理解机器学习模型在不同数据分布上的性能对于可靠应用至关重要。尽管近期实证研究提出分布内与分布外准确率之间存在近乎完美的线性相关性，但我们通过实验证明，在子群体偏移下这种相关性更为微妙。通过对多种数据集、模型及训练轮次的严格实验与分析，我们发现子群体偏移中分布外性能与分布内性能往往呈现非线性相关性。这一发现与先前提出分布偏移下模型性能呈线性相关的研究形成对比，揭示了多数子群体与少数子群体测试性能之间的“月牙形”相关性（抛物线上升曲线）。该非线性相关性在模型架构、超参数、训练时长以及子群体不平衡程度下均成立。此外，我们发现这种“月牙形”的非线性程度受训练数据中虚假相关性强度的因果影响——控制实验表明，更强的虚假相关性会导致更显著的非线性性能关联。我们为这一现象提供了互补性的实验与理论分析，并探讨其对机器学习可靠性与公平性的启示。本工作凸显了理解模型改进对不同子群体性能非线性效应的重要性，有望为开发更公平、负责任的机器学习模型提供指导。