Class imbalance (CI) in classification problems arises when the number of observations belonging to one class is lower than the other. Ensemble learning combines multiple models to obtain a robust model and has been prominently used with data augmentation methods to address class imbalance problems. In the last decade, a number of strategies have been added to enhance ensemble learning and data augmentation methods, along with new methods such as generative adversarial networks (GANs). A combination of these has been applied in many studies, and the evaluation of different combinations would enable a better understanding and guidance for different application domains. In this paper, we present a computational study to evaluate data augmentation and ensemble learning methods used to address prominent benchmark CI problems. We present a general framework that evaluates 9 data augmentation and 9 ensemble learning methods for CI problems. Our objective is to identify the most effective combination for improving classification performance on imbalanced datasets. The results indicate that combinations of data augmentation methods with ensemble learning can significantly improve classification performance on imbalanced datasets. We find that traditional data augmentation methods such as the synthetic minority oversampling technique (SMOTE) and random oversampling (ROS) are not only better in performance for selected CI problems, but also computationally less expensive than GANs. Our study is vital for the development of novel models for handling imbalanced datasets.
翻译:分类问题中的类别不平衡(CI)是指某一类别的观测样本数少于其他类别。集成学习通过组合多个模型以获得鲁棒模型,并且在处理类别不平衡问题时,常与数据增强方法结合使用。在过去十年中,除生成对抗网络(GANs)等新方法外,已有多种策略被引入以增强集成学习和数据增强方法。这些方法的组合已在众多研究中得到应用,而对不同组合的评估将为不同应用领域提供更好的理解和指导。本文通过计算实验评估了用于解决典型基准CI问题的数据增强与集成学习方法,提出了一个包含9种数据增强方法和9种集成学习方法的通用框架,旨在找出在不平衡数据集上提升分类性能的最优组合。结果表明,数据增强方法与集成学习的组合能显著改善不平衡数据集的分类性能。传统数据增强方法如合成少数类过采样技术(SMOTE)和随机过采样(ROS)不仅在选定CI问题中表现更优,而且计算成本低于GANs。本研究对开发处理不平衡数据集的新模型具有重要价值。