The task of precisely learning the probability distribution of rows within tabular data and producing authentic synthetic samples is both crucial and non-trivial. Wasserstein generative adversarial network (WGAN) marks a notable improvement in generative modeling, addressing the challenges faced by its predecessor, generative adversarial network. However, due to the mixed data types and multimodalities prevalent in tabular data, the delicate equilibrium between the generator and discriminator, as well as the inherent instability of Wasserstein distance in high dimensions, WGAN often fails to produce high-fidelity samples. To this end, we propose POTNet (Penalized Optimal Transport Network), a generative deep neural network based on a novel, robust, and interpretable marginally-penalized Wasserstein (MPW) loss. POTNet can effectively model tabular data containing both categorical and continuous features. Moreover, it offers the flexibility to condition on a subset of features. We provide theoretical justifications for the motivation behind the MPW loss. We also empirically demonstrate the effectiveness of our proposed method on four different benchmarks across a variety of real-world and simulated datasets. Our proposed model achieves orders of magnitude speedup during the sampling stage compared to state-of-the-art generative models for tabular data, thereby enabling efficient large-scale synthetic data generation.
翻译:精确学习表格数据中行概率分布并生成真实合成样本的任务既关键又具有挑战性。Wasserstein生成对抗网络(WGAN)在生成建模中取得了显著进展,解决了其前身生成对抗网络面临的挑战。然而,由于表格数据中普遍存在混合数据类型和多模态特性,生成器与判别器之间的微妙平衡以及Wasserstein距离在高维空间中固有的不稳定性,导致WGAN往往无法生成高保真样本。为此,我们提出POTNet(惩罚最优传输网络),这是一种基于新颖、鲁棒且可解释的边缘惩罚Wasserstein(MPW)损失的深度生成神经网络。POTNet能够有效建模包含类别特征和连续特征的表格数据。此外,它还提供了对特征子集进行条件生成的灵活性。我们为MPW损失的动机提供了理论依据,并通过在四种不同基准(涵盖多种真实世界和模拟数据集)上的实验验证了所提方法的有效性。与现有的表格数据生成模型相比,我们的模型在采样阶段实现了数量级的加速,从而能够高效生成大规模合成数据。