Anchor-based oversampling for imbalanced tabular data via contrastive and adversarial learning

Imbalanced data represent a distribution with more frequencies of one class (majority) than the other (minority). This phenomenon occurs across various domains, such as security, medical care and human activity. In imbalanced learning, classification algorithms are typically inclined to classify the majority class accurately, resulting in artificially high accuracy rates. As a result, many minority samples are mistakenly labelled as majority-class instances, resulting in a bias that benefits the majority class. This study presents a framework based on boundary anchor samples to tackle the imbalance learning challenge. First, we select and use anchor samples to train a multilayer perceptron (MLP) classifier, which acts as a prior knowledge model and aids the adversarial and contrastive learning procedures. Then, we designed a novel deep generative model called Anchor Stabilized Conditional Generative Adversarial Network or Anch-SCGAN in short. Anch-SCGAN is supported with two generators for the minority and majority classes and a discriminator incorporating additional class-specific information from the pre-trained feature extractor MLP. In addition, we facilitate the generator's training procedure in two ways. First, we define a new generator loss function based on reprocessed anchor samples and contrastive learning. Second, we apply a scoring strategy to stabilize the adversarial training part in generators. We train Anch-SCGAN and further finetune it with anchor samples to improve the precision of the generated samples. Our experiments on 16 real-world imbalanced datasets illustrate that Anch-SCGAN outperforms the renowned methods in imbalanced learning.

翻译：不平衡数据表现为某一类别（多数类）出现频率高于另一类别（少数类）的分布。这一现象广泛存在于安全、医疗及人类活动等多个领域。在不平衡学习中，分类算法通常倾向于准确分类多数类样本，导致人为的高准确率。因此，许多少数类样本被错误标记为多数类实例，从而产生偏向多数类的偏差。本研究提出一种基于边界锚点样本的框架以应对不平衡学习挑战。首先，我们选取并使用锚点样本训练一个多层感知机（MLP）分类器，该分类器作为先验知识模型，辅助后续的对抗学习与对比学习过程。随后，我们设计了一种新颖的深度生成模型，称为锚点稳定条件生成对抗网络（简称Anch-SCGAN）。Anch-SCGAN包含分别针对少数类与多数类的两个生成器，以及一个融合了来自预训练特征提取器MLP的额外类别特定信息的判别器。此外，我们从两个方面优化生成器的训练过程：第一，基于重处理的锚点样本与对比学习定义了一种新的生成器损失函数；第二，采用评分策略以稳定生成器中对抗训练部分的训练过程。我们训练Anch-SCGAN并进一步使用锚点样本对其进行微调，以提升生成样本的精确度。在16个真实世界不平衡数据集上的实验表明，Anch-SCGAN在不平衡学习任务中优于现有主流方法。