Class imbalance in real-world data poses a common bottleneck for machine learning tasks, since achieving good generalization on under-represented examples is often challenging. Mitigation strategies, such as under or oversampling the data depending on their abundances, are routinely proposed and tested empirically, but how they should adapt to the data statistics remains poorly understood. In this work, we determine exact analytical expressions of the generalization curves in the high-dimensional regime for linear classifiers (Support Vector Machines). We also provide a sharp prediction of the effects of under/oversampling strategies depending on class imbalance, first and second moments of the data, and the metrics of performance considered. We show that mixed strategies involving under and oversampling of data lead to performance improvement. Through numerical experiments, we show the relevance of our theoretical predictions on real datasets, on deeper architectures and with sampling strategies based on unsupervised probabilistic models.
翻译:现实世界数据中的类别不平衡是机器学习任务中常见的瓶颈,因为对少数样本实现良好泛化往往具有挑战性。目前常规提出的缓解策略(如根据数据丰度进行欠采样或过采样)虽已得到大量实证检验,但如何根据数据统计特性调整这些策略仍缺乏深入理解。本研究针对线性分类器(支持向量机)在高维极限下推导出泛化曲线的精确解析表达式。我们同时提供了关于欠采样/过采样策略效果的尖锐预测,这些预测取决于类别不平衡、数据的一阶矩和二阶矩以及所考虑的绩效指标。研究表明,结合欠采样与过采样的混合策略能够提升分类性能。通过数值实验,我们验证了理论预测在真实数据集、深度神经网络架构以及基于无监督概率模型的采样策略中的适用性。