CTTVAE: Latent Space Structuring for Conditional Tabular Data Generation on Imbalanced Datasets

Generating synthetic tabular data under severe class imbalance is essential for domains where rare but high-impact events drive decision-making. However, most generative models either overlook minority groups or fail to produce samples that are useful for downstream learning. We introduce CTTVAE, a Conditional Transformer-based Tabular Variational Autoencoder equipped with two complementary mechanisms: (i) a class-aware triplet margin loss that restructures the latent space for sharper intra-class compactness and inter-class separation, and (ii) a training-by-sampling strategy that adaptively increases exposure to underrepresented groups. Together, these components form CTTVAE+TBS, a framework that consistently yields more representative and utility-aligned samples without destabilizing training. Across six real-world benchmarks, CTTVAE+TBS achieves the strongest downstream utility on minority classes, often surpassing models trained on the original imbalanced data while maintaining competitive fidelity and bridging the gap for privacy for interpolation-based sampling methods and deep generative methods. Ablation studies further confirm that both latent structuring and targeted sampling contribute to these gains. By explicitly prioritizing downstream performance in rare categories, CTTVAE+TBS provides a robust and interpretable solution for conditional tabular data generation, with direct applicability to industries such as healthcare, fraud detection, and predictive maintenance where even small gains in minority cases can be critical.

翻译：在类别严重不平衡的情况下生成合成表格数据，对于决策过程由罕见但高影响事件驱动的领域至关重要。然而，大多数生成模型要么忽视少数群体，要么无法生成对下游学习有用的样本。我们提出了CTTVAE，一种基于条件Transformer的表格变分自编码器，它配备了两个互补机制：(i) 一种类感知三元组边界损失，通过重构潜在空间以实现更清晰的类内紧凑性和类间分离度；(ii) 一种训练中采样策略，能自适应地增加对代表性不足群体的关注。这些组件共同构成了CTTVAE+TBS框架，该框架在不破坏训练稳定性的前提下，能够持续生成更具代表性和实用性的样本。在六个真实世界基准测试中，CTTVAE+TBS在少数类上实现了最强的下游效用，其表现通常优于在原始不平衡数据上训练的模型，同时保持了有竞争力的保真度，并弥合了基于插值的采样方法与深度生成方法在隐私保护方面的差距。消融研究进一步证实，潜在空间结构化和目标采样均对这些增益有所贡献。通过明确优先考虑稀有类别的下游性能，CTTVAE+TBS为条件表格数据生成提供了一个鲁棒且可解释的解决方案，可直接应用于医疗健康、欺诈检测和预测性维护等行业，在这些领域中，即使在少数案例上取得微小改进也可能至关重要。