Class imbalance in binary classification tasks remains a significant challenge in machine learning, often resulting in poor performance on minority classes. This study comprehensively evaluates three widely-used strategies for handling class imbalance: Synthetic Minority Over-sampling Technique (SMOTE), Class Weights tuning, and Decision Threshold Calibration. We compare these methods against a baseline scenario of no-intervention across 15 diverse machine learning models and 30 datasets from various domains, conducting a total of 9,000 experiments. Performance was primarily assessed using the F1-score, although our study also tracked results on additional 9 metrics including F2-score, precision, recall, Brier-score, PR-AUC, and AUC. Our results indicate that all three strategies generally outperform the baseline, with Decision Threshold Calibration emerging as the most consistently effective technique. However, we observed substantial variability in the best-performing method across datasets, highlighting the importance of testing multiple approaches for specific problems. This study provides valuable insights for practitioners dealing with imbalanced datasets and emphasizes the need for dataset-specific analysis in evaluating class imbalance handling techniques.
翻译:二分类任务中的类别不平衡问题仍然是机器学习领域的一项重大挑战,常导致对少数类别的预测性能不佳。本研究全面评估了三种广泛使用的类别不平衡处理策略:合成少数类过采样技术(SMOTE)、类别权重调整以及决策阈值校准。我们在15种不同的机器学习模型和来自多个领域的30个数据集上,将这些方法与不进行干预的基线场景进行对比,共计进行了9,000次实验。性能评估主要采用F1分数,同时本研究也追踪了包括F2分数、精确率、召回率、Brier分数、PR-AUC和AUC在内的另外9项指标的结果。我们的结果表明,所有三种策略总体上均优于基线,其中决策阈值校准成为最稳定有效的技术。然而,我们观察到不同数据集上表现最佳的方法存在显著差异,这凸显了针对具体问题测试多种方法的重要性。本研究为处理不平衡数据集的研究者提供了有价值的见解,并强调了在评估类别不平衡处理技术时进行数据集特异性分析的必要性。