Synthetic Minority Oversampling Technique (SMOTE) is a common rebalancing strategy for handling imbalanced data sets. Asymptotically, we prove that SMOTE (with default parameter) regenerates the original distribution by simply copying the original minority samples. We also prove that SMOTE density vanishes near the boundary of the support of the minority distribution, therefore justifying the common BorderLine SMOTE strategy. Then we introduce two new SMOTE-related strategies, and compare them with state-of-the-art rebalancing procedures. We show that rebalancing strategies are only required when the data set is highly imbalanced. For such data sets, SMOTE, our proposals, or undersampling procedures are the best strategies.
翻译:合成少数类过采样技术(SMOTE)是处理不平衡数据集的常用再平衡策略。我们渐近地证明了,使用默认参数的SMOTE通过简单复制原始少数类样本即可重构原始分布。进一步证明了SMOTE密度在少数类分布支撑边界附近趋于零,从而验证了主流BorderLine SMOTE策略的合理性。随后引入两种新型SMOTE衍生策略,并与当前最先进的再平衡方案进行比较。研究表明,仅当数据集高度不平衡时才需采用再平衡策略。对于此类数据集,SMOTE、我们提出的新方法或欠采样策略均为最优选择。