Score-based generative models, commonly referred to as diffusion models, have proven to be successful at generating text and image data. However, their adaptation to mixed-type tabular data remains underexplored. In this work, we propose CDTD, a Continuous Diffusion model for mixed-type Tabular Data. CDTD is based on a novel combination of score matching and score interpolation to enforce a unified continuous noise distribution for both continuous and categorical features. We explicitly acknowledge the necessity of homogenizing distinct data types by relying on model-specific loss calibration and initialization schemes. To further address the high heterogeneity in mixed-type tabular data, we introduce adaptive feature- or type-specific noise schedules. These ensure balanced generative performance across features and optimize the allocation of model capacity across features and diffusion time. Our experimental results show that CDTD consistently outperforms state-of-the-art benchmark models, captures feature correlations exceptionally well, and that heterogeneity in the noise schedule design boosts sample quality. Replication code is available at https://github.com/muellermarkus/cdtd.
翻译:基于分数的生成模型(通常称为扩散模型)已成功应用于文本和图像数据的生成。然而,其在混合类型表格数据上的适应性仍待深入探索。本研究提出CDTD,一种针对混合类型表格数据的连续扩散模型。CDTD基于分数匹配与分数插值的新颖组合,通过对连续特征和分类特征实施统一的连续噪声分布来实现。我们通过模型特定的损失校准与初始化方案,明确认识到统一不同数据类型的必要性。为进一步应对混合类型表格数据中的高度异质性,我们引入了自适应特征或类型特定的噪声调度机制。这些机制确保了跨特征的平衡生成性能,并优化了模型容量在特征与扩散时间上的分配。实验结果表明,CDTD始终优于最先进的基准模型,能出色捕捉特征相关性,且噪声调度设计中的异质性显著提升了样本质量。复现代码可在https://github.com/muellermarkus/cdtd获取。