Class imbalance would lead to biased classifiers that favor the majority class and disadvantage the minority class. Unfortunately, from a practical perspective, the minority class is of importance in many real-life applications. Hybrid sampling methods address this by oversampling the minority class to increase the number of its instances, followed by undersampling to remove low-quality instances. However, most existing sampling methods face difficulties in generating diverse high-quality instances and often fail to remove noise or low-quality instances on a larger scale effectively. This paper therefore proposes an evolutionary multi-granularity hybrid sampling method, called EvoSampling. During the oversampling process, genetic programming (GP) is used with multi-task learning to effectively and efficiently generate diverse high-quality instances. During the undersampling process, we develop a granular ball-based undersampling method that removes noise in a multi-granular fashion, thereby enhancing data quality. Experiments on 20 imbalanced datasets demonstrate that EvoSampling effectively enhances the performance of various classification algorithms by providing better datasets than existing sampling methods. Besides, ablation studies further indicate that allowing knowledge transfer accelerates the GP's evolutionary learning process.
翻译:类别不平衡会导致分类器产生偏差,使其偏向多数类而不利于少数类。然而,从实际应用的角度看,少数类在许多现实场景中至关重要。混合采样方法通过过采样少数类以增加其实例数量,随后进行欠采样以移除低质量实例,从而应对这一问题。然而,现有的大多数采样方法在生成多样化的高质量实例方面面临困难,并且往往无法有效移除大规模噪声或低质量实例。为此,本文提出了一种进化多粒度混合采样方法,称为EvoSampling。在过采样过程中,采用遗传编程(GP)结合多任务学习,以高效且有效地生成多样化的高质量实例。在欠采样过程中,我们开发了一种基于粒球的欠采样方法,以多粒度方式移除噪声,从而提升数据质量。在20个不平衡数据集上的实验表明,与现有采样方法相比,EvoSampling通过提供更优的数据集,有效提升了多种分类算法的性能。此外,消融研究进一步表明,允许知识迁移能够加速GP的进化学习过程。