Recent advances in a generative neural network model extend the development of data augmentation methods. However, the augmentation methods based on the modern generative models fail to achieve notable performance for class imbalance data compared to the conventional model, Synthetic Minority Oversampling Technique (SMOTE). We investigate the problem of the generative model for imbalanced classification and introduce a framework to enhance the SMOTE algorithm using Variational Autoencoders (VAE). Our approach systematically quantifies the density of data points in a low-dimensional latent space using the VAE, simultaneously incorporating information on class labels and classification difficulty. Then, the data points potentially degrading the augmentation are systematically excluded, and the neighboring observations are directly augmented on the data space. Empirical studies on several imbalanced datasets represent that this simple process innovatively improves the conventional SMOTE algorithm over the deep learning models. Consequently, we conclude that the selection of minority data and the interpolation in the data space are beneficial for imbalanced classification problems with a relatively small number of data points.
翻译:生成神经网络模型的最新进展推动了数据增强方法的发展。然而,基于现代生成模型的增强方法在处理类别不平衡数据时,相较于传统模型——合成少数类过采样技术(SMOTE),未能取得显著性能提升。本文研究了生成模型在不平衡分类中的问题,并提出了一个利用变分自编码器(VAE)增强SMOTE算法的框架。我们的方法通过VAE在低维潜空间中系统量化数据点的密度,同时融入类别标签和分类难度信息。随后,可能降低增强效果的数据点被系统性地排除,并在数据空间中对邻近观测值直接进行增强。在多个不平衡数据集上的实证研究表明,这一简单流程显著改进了传统SMOTE算法在深度学习模型上的表现。因此,我们得出结论:在数据点数量相对较少的不平衡分类问题中,对少数类数据的选择及其在数据空间中的插值具有显著优势。