Class imbalance, which is also called long-tailed distribution, is a common problem in classification tasks based on machine learning. If it happens, the minority data will be overwhelmed by the majority, which presents quite a challenge for data science. To address the class imbalance problem, researchers have proposed lots of methods: some people make the data set balanced (SMOTE), some others refine the loss function (Focal Loss), and even someone has noticed the value of labels influences class-imbalanced learning (Yang and Xu. Rethinking the value of labels for improving class-imbalanced learning. In NeurIPS 2020), but no one changes the way to encode the labels of data yet. Nowadays, the most prevailing technique to encode labels is the one-hot encoding due to its nice performance in the general situation. However, it is not a good choice for imbalanced data, because the classifier will treat majority and minority samples equally. In this paper, we innovatively propose the enhancement encoding technique, which is specially designed for the imbalanced classification. The enhancement encoding combines re-weighting and cost-sensitiveness, which can reflect the difference between hard and easy (or minority and majority) classes. To reduce the number of validation samples and the computation cost, we also replace the confusion matrix with a novel soft-confusion matrix which works better with a small validation set. In the experiments, we evaluate the enhancement encoding with three different types of loss. And the results show that enhancement encoding is very effective to improve the performance of the network trained with imbalanced data. Particularly, the performance on minority classes is much better.
翻译:类别不平衡(也称为长尾分布)是基于机器学习分类任务中的常见问题。当不平衡发生时,少数类样本会被多数类样本淹没,这对数据科学构成了巨大挑战。为解决类别不平衡问题,研究者提出了多种方法:有人通过数据采样使数据集平衡(如SMOTE),有人改进损失函数(如Focal Loss),甚至有人注意到标签值对不平衡学习的影响(Yang和Xu,Rethinking the value of labels for improving class-imbalanced learning,发表于NeurIPS 2020),但至今无人改变数据标签的编码方式。目前,最流行的标签编码技术是一热编码(one-hot encoding),因其在一般情况下表现优异。然而,对于不平衡数据而言,这种编码方式并非理想选择,因为分类器会同等对待多数类和少数类样本。本文创新性地提出增强编码(enhancement encoding)技术,该技术专为不平衡分类设计。增强编码融合了重加权与代价敏感性,能够反映难易类别(或多数与少数类别)之间的差异。为减少验证样本数量并降低计算成本,我们还提出用新型软混淆矩阵(soft-confusion matrix)替代传统混淆矩阵,该矩阵在小规模验证集上表现更优。实验中,我们采用三种不同类型的损失函数对增强编码进行评估。结果表明,增强编码能有效提升在不平衡数据上训练的网络性能,尤其在少数类上的表现显著改善。