Synthetic Minority Oversampling Technique (SMOTE) is a common rebalancing strategy for handling imbalanced tabular data sets. However, few works analyze SMOTE theoretically. In this paper, we prove that SMOTE (with default parameter) tends to copy the original minority samples asymptotically. We also prove that SMOTE exhibits boundary artifacts, thus justifying existing SMOTE variants. Then we introduce two new SMOTE-related strategies, and compare them with state-of-the-art rebalancing procedures. Surprisingly, for most data sets, we observe that applying no rebalancing strategy is competitive in terms of predictive performances, with tuned random forests, logistic regression or LightGBM. For highly imbalanced data sets, our new methods, named CV-SMOTE and Multivariate Gaussian SMOTE, are competitive. Besides, our analysis sheds some lights on the behavior of common rebalancing strategies, when used in conjunction with random forests.
翻译:合成少数类过采样技术(SMOTE)是处理不平衡表格数据集的常用重平衡策略。然而,目前很少有研究从理论角度分析SMOTE。本文证明,SMOTE(采用默认参数)在渐近意义上倾向于复制原始少数类样本。我们还证明了SMOTE存在边界伪影现象,从而为现有SMOTE变体提供了理论依据。随后我们提出了两种新的SMOTE相关策略,并将其与前沿的重平衡方法进行比较。令人惊讶的是,对于大多数数据集,在使用调优后的随机森林、逻辑回归或LightGBM时,不采用任何重平衡策略在预测性能方面具有竞争力。对于高度不平衡的数据集,我们提出的新方法——CV-SMOTE与多元高斯SMOTE——表现出竞争优势。此外,我们的分析揭示了常见重平衡策略与随机森林结合使用时的行为特征。