Class imbalance exists in many classification problems, and since the data is designed for accuracy, imbalance in data classes can lead to classification challenges with a few classes having higher misclassification costs. The Backblaze dataset, a widely used dataset related to hard discs, has a small amount of failure data and a large amount of health data, which exhibits a serious class imbalance. This paper provides a comprehensive overview of research in the field of imbalanced data classification. The discussion is organized into three main aspects: data-level methods, algorithmic-level methods, and hybrid methods. For each type of method, we summarize and analyze the existing problems, algorithmic ideas, strengths, and weaknesses. Additionally, the challenges of unbalanced data classification are discussed, along with strategies to address them. It is convenient for researchers to choose the appropriate method according to their needs.
翻译:分类问题中普遍存在类别不平衡现象,由于数据设计以准确率为导向,数据类别的失衡会导致少数类具有更高误分类成本的分类挑战。Backblaze数据集作为广泛使用的硬盘相关数据集,包含少量故障数据与大量健康数据,呈现出严重的类别不平衡问题。本文对不平衡数据分类领域的研究进行了全面综述,从三个主要维度展开讨论:数据层面方法、算法层面方法和混合方法。针对每类方法,我们总结并分析了现有问题、算法思路、优势与不足。此外,探讨了不平衡数据分类面临的挑战及其应对策略,便于研究人员根据需求选择合适的方法。