Class imbalance (CI) in classification problems arises when the number of observations belonging to one class is lower than the other classes. Ensemble learning that combines multiple models to obtain a robust model has been prominently used with data augmentation methods to address class imbalance problems. In the last decade, a number of strategies have been added to enhance ensemble learning and data augmentation methods, along with new methods such as generative adversarial networks (GANs). A combination of these has been applied in many studies, but the true rank of different combinations would require a computational review. In this paper, we present a computational review to evaluate data augmentation and ensemble learning methods used to address prominent benchmark CI problems. We propose a general framework that evaluates 10 data augmentation and 10 ensemble learning methods for CI problems. Our objective was to identify the most effective combination for improving classification performance on imbalanced datasets. The results indicate that combinations of data augmentation methods with ensemble learning can significantly improve classification performance on imbalanced datasets. These findings have important implications for the development of more effective approaches for handling imbalanced datasets in machine learning applications.
翻译:分类问题中的类别不平衡(Class Imbalance, CI)指某一类别的观测样本数量低于其他类别。集成学习通过组合多个模型以获得稳健模型,已与数据增强方法联合广泛应用于解决类别不平衡问题。近十年来,研究者为增强集成学习与数据增强方法增加了多种策略,并提出了生成对抗网络(GANs)等新方法。许多研究已应用这些方法的组合,但不同组合的真实性能排名仍需通过计算性综述来明确。本文开展了一项计算性综述,评估用于解决经典基准CI问题的数据增强与集成学习方法。我们提出一个通用框架,对CI问题中的10种数据增强方法与10种集成学习方法进行了评估。研究目标在于识别能提升不平衡数据集分类性能的最优组合。结果表明,数据增强方法与集成学习的组合能显著改善不平衡数据集的分类性能。这些发现对开发更有效的机器学习不平衡数据集处理方法具有重要启示。