When facing data with imbalanced classes or groups, practitioners follow an intriguing strategy to achieve best results. They throw away examples until the classes or groups are balanced in size, and then perform empirical risk minimization on the reduced training set. This opposes common wisdom in learning theory, where the expected error is supposed to decrease as the dataset grows in size. In this work, we leverage extreme value theory to address this apparent contradiction. Our results show that the tails of the data distribution play an important role in determining the worst-group-accuracy of linear classifiers. When learning on data with heavy tails, throwing away data restores the geometric symmetry of the resulting classifier, and therefore improves its worst-group generalization.
翻译:面对类别或组别不平衡的数据时,实践者采用一种引人注目的策略以获得最佳结果:他们丢弃样本,直至各类别或组别的规模平衡,随后在缩减后的训练集上执行经验风险最小化。这一做法与学习理论中的传统观点相悖——传统观点认为,随着数据集规模增大,期望误差应随之减小。在本工作中,我们利用极值理论来阐明这一表面矛盾。我们的结果表明,数据分布的尾部对线性分类器的最差组别准确率起关键作用。当学习基于重尾分布的数据时,丢弃数据能恢复所得分类器的几何对称性,从而提升其最差组别的泛化性能。