Synthetic Minority Oversampling Technique (SMOTE) is a common rebalancing strategy for handling imbalanced tabular data sets. However, few works analyze SMOTE theoretically. In this paper, we prove that SMOTE (with default parameter) simply copies the original minority samples asymptotically. We also prove that SMOTE exhibits boundary artifacts, thus justifying existing SMOTE variants. Then we introduce two new SMOTE-related strategies, and compare them with state-of-the-art rebalancing procedures. Surprisingly, for most data sets, we observe that applying no rebalancing strategy is competitive in terms of predictive performances, with tuned random forests. For highly imbalanced data sets, our new method, named Multivariate Gaussian SMOTE, is competitive. Besides, our analysis sheds some lights on the behavior of common rebalancing strategies, when used in conjunction with random forests.
翻译:合成少数类过采样技术(SMOTE)是处理不平衡表格数据集的常用重平衡策略。然而,很少有研究从理论角度分析SMOTE。本文证明,SMOTE(采用默认参数)在渐近意义上仅复制原始少数类样本。我们还证明SMOTE存在边界伪影,从而为现有SMOTE变体提供了理论依据。随后我们提出两种新的SMOTE相关策略,并将其与先进的重平衡方法进行比较。令人惊讶的是,对于大多数数据集,在使用调优随机森林的情况下,不采用任何重平衡策略在预测性能方面具有竞争力。对于高度不平衡的数据集,我们提出的新方法——多元高斯SMOTE——表现出竞争力。此外,我们的分析揭示了常见重平衡策略与随机森林结合使用时的行为特征。