In this paper, our goal is to generate synthetic data for heterogeneous (mixed-type) tabular datasets with high machine learning utility (MLu). Given that the MLu performance relies on accurately approximating the conditional distributions, we focus on devising a synthetic data generation method based on conditional distribution estimation. We propose a novel synthetic data generation method, MaCoDE, by redefining the multi-class classification task of Masked Language Modeling (MLM) as histogram-based non-parametric conditional density estimation. Our proposed method enables estimating conditional densities across arbitrary combinations of target and conditional variables. Furthermore, we demonstrate that our proposed method bridges the theoretical gap between distributional learning and MLM. To validate the effectiveness of our proposed model, we conduct synthetic data generation experiments on 10 real-world datasets. Given the analogy between predicting masked input tokens in MLM and missing data imputation, we also evaluate the performance of multiple imputations on incomplete datasets with various missing data mechanisms. Moreover, our proposed model offers the advantage of enabling adjustments to data privacy levels without requiring re-training.
翻译:本文旨在为异构(混合类型)表格数据集生成具有高机器学习效用(MLu)的合成数据。鉴于机器学习效用性能依赖于对条件分布的精确逼近,我们专注于设计一种基于条件分布估计的合成数据生成方法。我们提出了一种新颖的合成数据生成方法MaCoDE,通过将掩码语言建模(MLM)的多类别分类任务重新定义为基于直方图的非参数条件密度估计。我们提出的方法能够估计任意目标变量与条件变量组合间的条件密度。此外,我们证明了该方法弥合了分布学习与掩码语言建模之间的理论鸿沟。为验证所提模型的有效性,我们在10个真实世界数据集上进行了合成数据生成实验。鉴于掩码语言建模中预测掩码输入标记与缺失数据插补之间的类比性,我们还评估了该方法在不同缺失数据机制的不完整数据集上的多重插补性能。此外,所提模型具备无需重新训练即可调整数据隐私级别的优势。