Synthetic Tabular Data Generation for Imbalanced Classification: The Surprising Effectiveness of an Overlap Class

Handling imbalance in class distribution when building a classifier over tabular data has been a problem of long-standing interest. One popular approach is augmenting the training dataset with synthetically generated data. While classical augmentation techniques were limited to linear interpolation of existing minority class examples, recently higher capacity deep generative models are providing greater promise. However, handling of imbalance in class distribution when building a deep generative model is also a challenging problem, that has not been studied as extensively as imbalanced classifier model training. We show that state-of-the-art deep generative models yield significantly lower-quality minority examples than majority examples. %In this paper, we start with the observation that imbalanced data training of generative models trained imbalanced dataset which under-represent the minority class. We propose a novel technique of converting the binary class labels to ternary class labels by introducing a class for the region where minority and majority distributions overlap. We show that just this pre-processing of the training set, significantly improves the quality of data generated spanning several state-of-the-art diffusion and GAN-based models. While training the classifier using synthetic data, we remove the overlap class from the training data and justify the reasons behind the enhanced accuracy. We perform extensive experiments on four real-life datasets, five different classifiers, and five generative models demonstrating that our method enhances not only the synthesizer performance of state-of-the-art models but also the classifier performance.

翻译：在基于表格数据构建分类器时，处理类别分布不平衡问题长期以来一直备受关注。一种流行的方法是通过合成生成的数据来扩增训练数据集。虽然经典的扩增技术仅限于对现有少数类样本进行线性插值，但近期更高容量的深度生成模型展现出更大的潜力。然而，在构建深度生成模型时处理类别分布不平衡同样是一个具有挑战性的问题，其研究深度尚不及不平衡分类器模型训练。我们证明，当前最先进的深度生成模型生成的少数类样本质量显著低于多数类样本。本文提出一种新颖的技术，通过引入一个表征少数类与多数类分布重叠区域的类别，将二元类别标签转换为三元类别标签。我们证明，仅对训练集进行这种预处理，就能显著提升多种基于扩散模型和生成对抗网络（GAN）的先进模型所生成数据的质量。在使用合成数据训练分类器时，我们从训练数据中移除重叠类别，并论证了这种处理提升分类精度的内在机理。我们在四个真实数据集上进行了大量实验，涵盖五种不同分类器和五种生成模型，结果表明我们的方法不仅提升了先进模型的合成性能，同时也显著改善了分类器性能。