Class imbalance and distributional differences in large datasets present significant challenges for classification tasks machine learning, often leading to biased models and poor predictive performance for minority classes. This work introduces two novel undersampling approaches: mutual information-based stratified simple random sampling and support points optimization. These methods prioritize representative data selection, effectively minimizing information loss. Empirical results across multiple classification tasks demonstrate that our methods outperform traditional undersampling techniques, achieving higher balanced classification accuracy. These findings highlight the potential of combining statistical concepts with machine learning to address class imbalance in practical applications.
翻译:大规模数据集中的类别不平衡与分布差异对机器学习分类任务构成了重大挑战,常导致模型存在偏差且对少数类别的预测性能不佳。本研究提出了两种新颖的欠采样方法:基于互信息的分层简单随机抽样与支撑点优化方法。这些方法以代表性数据选择为优先原则,能有效最小化信息损失。在多个分类任务上的实证结果表明,本方法优于传统欠采样技术,实现了更高的平衡分类准确率。这些发现凸显了将统计概念与机器学习相结合以解决实际应用中类别不平衡问题的潜力。