Class imbalance (CI) in classification problems arises when the number of observations belonging to one class is lower than the other. Ensemble learning combines multiple models to obtain a robust model and has been prominently used with data augmentation methods to address class imbalance problems. In the last decade, a number of strategies have been added to enhance ensemble learning and data augmentation methods, along with new methods such as generative adversarial networks (GANs). A combination of these has been applied in many studies, but the true rank of different combinations would require a computational review. In this paper, we present a computational review to evaluate data augmentation and ensemble learning methods used to address prominent benchmark CI problems. We present a general framework that evaluates 10 data augmentation and 10 ensemble learning methods for CI problems. Our objective is to identify the most effective combination for improving classification performance on imbalanced datasets. The results indicate that combinations of data augmentation methods with ensemble learning can significantly improve classification performance on imbalanced datasets. Our study is vital for the development of novel models for handling imbalanced datasets.
翻译:分类问题中的类别不平衡(CI)是指某一类别的观测数量少于其他类别。集成学习通过组合多个模型以获得稳健模型,常与数据增强方法结合用于解决类别不平衡问题。过去十年间,研究人员提出了多种策略来增强集成学习与数据增强方法,同时涌现了如生成对抗网络(GANs)等新方法。尽管这些方法的组合已在多项研究中得到应用,但不同组合的真实性能排名仍需通过计算性综述加以明确。本文通过一项计算性综述,系统评估了用于解决典型基准CI问题的数据增强与集成学习方法。我们提出了一个通用框架,对10种数据增强方法和10种集成学习方法在CI问题中的表现进行了评估,旨在识别提升不平衡数据集分类性能的最优组合。结果表明,数据增强方法与集成学习的组合可显著改善不平衡数据集的分类性能。本研究对于开发处理不平衡数据集的新模型具有重要意义。