Given imbalanced data, it is hard to train a good classifier using deep learning because of the poor generalization of minority classes. Traditionally, the well-known synthetic minority oversampling technique (SMOTE) for data augmentation, a data mining approach for imbalanced learning, has been used to improve this generalization. However, it is unclear whether SMOTE also benefits deep learning. In this work, we study why the original SMOTE is insufficient for deep learning, and enhance SMOTE using soft labels. Connecting the resulting soft SMOTE with Mixup, a modern data augmentation technique, leads to a unified framework that puts traditional and modern data augmentation techniques under the same umbrella. A careful study within this framework shows that Mixup improves generalization by implicitly achieving uneven margins between majority and minority classes. We then propose a novel margin-aware Mixup technique that more explicitly achieves uneven margins. Extensive experimental results demonstrate that our proposed technique yields state-of-the-art performance on deep imbalanced classification while achieving superior performance on extremely imbalanced data. The code is open-sourced in our developed package https://github.com/ntucllab/imbalanced-DL to foster future research in this direction.
翻译:给定不均衡数据,由于少数类泛化能力差,使用深度学习训练优质分类器十分困难。传统上,用于数据增强的合成少数类过采样技术(SMOTE)作为不均衡学习的数据挖掘方法,被用来改善这种泛化能力。然而,SMOTE是否同样有益于深度学习尚不明确。本文研究原始SMOTE为何不足以用于深度学习,并通过引入软标签增强SMOTE。将生成的软SMOTE与现代化数据增强技术Mixup相结合,形成了将传统与现代化数据增强技术统一在同一框架下的体系。该框架下的细致研究表明,Mixup通过隐式实现多数类与少数类之间的不均衡间隔来改善泛化能力。我们随后提出一种新颖的间隔感知Mixup技术,更显式地实现不均衡间隔。大量实验结果表明,我们提出的技术在深度不均衡分类中达到最优性能,同时在极度不均衡数据上展现出卓越表现。相关代码已开源至我们开发的包https://github.com/ntucllab/imbalanced-DL,以促进该方向的未来研究。