Imbalanced datasets, where one class significantly outnumbers others, remain a persistent challenge in machine learning, often biasing predictions toward the majority class and degrading classifier performance. This paper provides a comprehensive, systematic review of data balancing methods, extending beyond foundational oversampling techniques such as the Synthetic Minority Oversampling Technique (SMOTE) and its variants (e.g., Borderline SMOTE, K-Means SMOTE, and Safe-Level SMOTE) to encompass advanced adaptive methods (MWMOTE, AMDO), deep generative models (generative adversarial networks, variational autoencoders, and diffusion models), undersampling techniques (NearMiss, Tomek Links), combination/hybrid methods (SMOTE-ENN, SMOTE-Tomek, and SMOTE+OCSVM), ensemble strategies (SMOTEBoost, RUSBoost, Balanced Random Forest, and One-Sided Selection), and specialized approaches for multi-label and clustered data. Beyond descriptive categorization, this review critically examines each method's underlying assumptions, operational mechanisms, and suitability for diverse data characteristics, including high dimensionality, mixed feature types, class overlap, and noise. Key findings demonstrate that no single method universally outperforms others; optimal selection depends critically on dataset characteristics, classifier choice, and evaluation metrics. The paper concludes by identifying emerging research directions, including self-supervised learning for imbalance, diffusion-based generative oversampling, distribution-preserving resampling, knowledge distillation for imbalanced deployment, and the adaptation of foundation models to skewed distributions, offering practical guidelines for practitioners and a roadmap for future methodological development.
翻译:类别不平衡数据集(即某一类样本数量显著多于其他类别)仍是机器学习领域的持续性挑战,常导致模型预测偏向多数类,降低分类器性能。本文对数据平衡方法进行了全面系统的综述,不仅涵盖基础过采样技术如合成少数类过采样技术(SMOTE)及其变体(如Borderline SMOTE、K-Means SMOTE和Safe-Level SMOTE),还拓展至先进自适应方法(MWMOTE、AMDO)、深度生成模型(生成对抗网络、变分自编码器及扩散模型)、欠采样技术(NearMiss、Tomek Links)、组合/混合方法(SMOTE-ENN、SMOTE-Tomek和SMOTE+OCSVM)、集成策略(SMOTEBoost、RUSBoost、平衡随机森林和单边选择策略),以及针对多标签和聚类数据的专门方法。除描述性分类外,本文深入剖析了每种方法的潜在假设、运行机制及对不同数据特征(包括高维性、混合特征类型、类别重叠和噪声)的适应性。关键发现表明,不存在普遍优于其他方法的单一策略;最优选择高度依赖于数据集特征、分类器选择和评估指标。本文最后指出了新兴研究方向,包括面向不平衡数据的自监督学习、基于扩散模型的生成式过采样、保持分布特性的重采样、面向不平衡部署的知识蒸馏,以及基础模型对偏态分布的适配,为实践者提供实用指南,并为未来方法论发展绘制路线图。