In the process of training a generative model, it becomes essential to measure the discrepancy between two high-dimensional probability distributions: the generative distribution and the ground-truth distribution of the observed dataset. Recently, there has been growing interest in an approach that involves slicing high-dimensional distributions, with the Cramer-Wold distance emerging as a promising method. However, we have identified that the Cramer-Wold distance primarily focuses on joint distributional learning, whereas understanding marginal distributional patterns is crucial for effective synthetic data generation. In this paper, we introduce a novel measure of dissimilarity, the mixture Cramer-Wold distance. This measure enables us to capture both marginal and joint distributional information simultaneously, as it incorporates a mixture measure with point masses on standard basis vectors. Building upon the mixture Cramer-Wold distance, we propose a new generative model called CWDAE (Cramer-Wold Distributional AutoEncoder), which shows remarkable performance in generating synthetic data when applied to real tabular datasets. Furthermore, our model offers the flexibility to adjust the level of data privacy with ease.
翻译:在生成模型训练过程中,衡量观测数据集真实分布与生成分布这两个高维概率分布之间的差异至关重要。近年来,通过对高维分布进行切片处理的研究方法日益受到关注,其中Cramér-Wold距离作为一种颇具前景的度量方式逐渐兴起。但我们发现,Cramér-Wold距离主要聚焦于联合分布学习,而理解边际分布模式对有效合成数据生成同样关键。本文提出一种新型差异度量——混合Cramér-Wold距离,该度量通过在标准基向量上设置点质量混合测度,能够同时捕捉边际分布与联合分布信息。基于混合Cramér-Wold距离,我们构建了名为CWDAE(Cramér-Wold分布自编码器)的新型生成模型,在真实表格数据集上展现出卓越的合成数据生成性能。此外,该模型还能灵活调节数据隐私保护等级。