Recent advances in a generative neural network model extend the development of data augmentation methods. However, the augmentation methods based on the modern generative models fail to achieve notable performance for class imbalance data compared to the conventional model, Synthetic Minority Oversampling Technique (SMOTE). We investigate the problem of the generative model for imbalanced classification and introduce a framework to enhance the SMOTE algorithm using Variational Autoencoders (VAE). Our approach systematically quantifies the density of data points in a low-dimensional latent space using the VAE, simultaneously incorporating information on class labels and classification difficulty. Then, the data points potentially degrading the augmentation are systematically excluded, and the neighboring observations are directly augmented on the data space. Empirical studies on several imbalanced datasets represent that this simple process innovatively improves the conventional SMOTE algorithm over the deep learning models. Consequently, we conclude that the selection of minority data and the interpolation in the data space are beneficial for imbalanced classification problems with a relatively small number of data points.
翻译:生成神经网络模型的最新进展推动了数据增强方法的发展。然而,基于现代生成模型的增强方法在处理类别不平衡数据时,相较于传统模型——合成少数类过采样技术(SMOTE),未能取得显著性能提升。本研究针对不平衡分类中生成模型存在的问题展开探讨,提出了一种利用变分自编码器(VAE)增强SMOTE算法的框架。该方法通过VAE在低维潜在空间中系统量化数据点的密度,同时融入类别标签与分类难度信息。随后,系统性地排除可能降低增强效果的数据点,并在数据空间中对邻近观测值直接进行增强。在多个不平衡数据集上的实证研究表明,这一简单流程显著改进了传统SMOTE算法在深度学习模型上的表现。因此我们得出结论:在数据点数量相对较少的不平衡分类问题中,对少数类数据的选择及其在数据空间中的插值操作具有显著优势。