Group imbalance has been a known problem in empirical risk minimization (ERM), where the achieved high average accuracy is accompanied by low accuracy in a minority group. Despite algorithmic efforts to improve the minority group accuracy, a theoretical generalization analysis of ERM on individual groups remains elusive. By formulating the group imbalance problem with the Gaussian Mixture Model, this paper quantifies the impact of individual groups on the sample complexity, the convergence rate, and the average and group-level testing performance. Although our theoretical framework is centered on binary classification using a one-hidden-layer neural network, to the best of our knowledge, we provide the first theoretical analysis of the group-level generalization of ERM in addition to the commonly studied average generalization performance. Sample insights of our theoretical results include that when all group-level co-variance is in the medium regime and all mean are close to zero, the learning performance is most desirable in the sense of a small sample complexity, a fast training rate, and a high average and group-level testing accuracy. Moreover, we show that increasing the fraction of the minority group in the training data does not necessarily improve the generalization performance of the minority group. Our theoretical results are validated on both synthetic and empirical datasets, such as CelebA and CIFAR-10 in image classification.
翻译:群体不平衡一直是经验风险最小化(ERM)中已知的问题,即实现的高平均准确率伴随着少数群体的低准确率。尽管已有算法努力提升少数群体的准确率,但ERM在个体群体上的理论泛化分析仍然缺乏。本文将群体不平衡问题建模为高斯混合模型,量化了个体群体对样本复杂度、收敛速度以及平均和群体级测试性能的影响。尽管我们的理论框架聚焦于使用单隐层神经网络的二分类任务,但据我们所知,我们首次在常见平均泛化性能之外,提供了ERM群体级泛化的理论分析。理论结果的关键洞见包括:当所有群体级协方差处于中等区间且均值接近零时,学习性能最优,表现为样本复杂度低、训练速度快、平均和群体级测试准确率高。此外,我们证明增加训练数据中少数群体的比例并不一定能改善该群体的泛化性能。我们的理论结果在合成数据集和实际数据集(如图像分类中的CelebA和CIFAR-10)上均得到验证。