This paper evaluates six strategies for mitigating imbalanced data: oversampling, undersampling, ensemble methods, specialized algorithms, class weight adjustments, and a no-mitigation approach referred to as the baseline. These strategies were tested on 58 real-life binary imbalanced datasets with imbalance rates ranging from 3 to 120. We conducted a comparative analysis of 10 under-sampling algorithms, 5 over-sampling algorithms, 2 ensemble methods, and 3 specialized algorithms across eight different performance metrics: accuracy, area under the ROC curve (AUC), balanced accuracy, F1-measure, G-mean, Matthew's correlation coefficient, precision, and recall. Additionally, we assessed the six strategies on altered datasets, derived from real-life data, with both low (3) and high (100 or 300) imbalance ratios (IR). The principal finding indicates that the effectiveness of each strategy significantly varies depending on the metric used. The paper also examines a selection of newer algorithms within the categories of specialized algorithms, oversampling, and ensemble methods. The findings suggest that the current hierarchy of best-performing strategies for each metric is unlikely to change with the introduction of newer algorithms.
翻译:本文对六种缓解不平衡数据的策略进行了评估:过采样、欠采样、集成方法、专用算法、类别权重调整以及一种称为基线的无缓解策略。这些策略在58个真实世界的二分类不平衡数据集上进行了测试,其不平衡率范围从3到120。我们对10种欠采样算法、5种过采样算法、2种集成方法和3种专用算法进行了比较分析,使用了八种不同的性能指标:准确率、ROC曲线下面积(AUC)、平衡准确率、F1度量、G-均值、马修斯相关系数、精确率和召回率。此外,我们在从真实数据衍生的、具有低(3)和高(100或300)不平衡比(IR)的修改数据集上评估了这六种策略。主要发现表明,每种策略的有效性显著取决于所使用的指标。本文还考察了专用算法、过采样和集成方法类别中的若干较新算法。结果表明,对于每种指标,当前最佳策略的层级不太可能因新算法的引入而改变。