Imbalanced classification is a well-known challenge faced by many real-world applications. This issue occurs when the distribution of the target variable is skewed, leading to a prediction bias toward the majority class. With the arrival of the Big Data era, there is a pressing need for efficient solutions to solve this problem. In this work, we present a novel resampling method called SMOTENN that combines intelligent undersampling and oversampling using a MapReduce framework. Both procedures are performed on the same pass over the data, conferring efficiency to the technique. The SMOTENN method is complemented with an efficient implementation of the neighborhoods related to the minority samples. Our experimental results show the virtues of this approach, outperforming alternative resampling techniques for small- and medium-sized datasets while achieving positive results on large datasets with reduced running times.
翻译:不平衡分类是许多实际应用面临的经典挑战。当目标变量分布倾斜时,会导致预测偏向多数类,从而产生这一问题。随着大数据时代的到来,迫切需要高效的解决方案来解决该问题。本文提出一种名为SMOTENN的新型重采样方法,该方法通过MapReduce框架将智能欠采样与过采样相结合。两个过程基于同一数据通道执行,从而提升了技术的效率。SMOTENN方法还辅以针对少数类样本邻域的高效实现。实验结果表明,该方法在小规模和中规模数据集上优于其他重采样技术,同时在大规模数据集上能以更短的运行时间取得积极效果。