Score-based generative models, commonly referred to as diffusion models, have proven to be successful at generating text and image data. However, their adaptation to mixed-type tabular data remains underexplored. In this work, we propose CDTD, a Continuous Diffusion model for mixed-type Tabular Data. CDTD is based on a novel combination of score matching and score interpolation to enforce a unified continuous noise distribution for both continuous and categorical features. We explicitly acknowledge the necessity of homogenizing distinct data types by relying on model-specific loss calibration and initialization schemes. To further address the high heterogeneity in mixed-type tabular data, we introduce adaptive feature- or type-specific noise schedules. These ensure balanced generative performance across features and optimize the allocation of model capacity across features and diffusion time. Our experimental results show that CDTD consistently outperforms state-of-the-art benchmark models, captures feature correlations exceptionally well, and that heterogeneity in the noise schedule design boosts sample quality. Replication code is available at https://github.com/muellermarkus/cdtd.
翻译:基于得分的生成模型(通常称为扩散模型)已被证明在生成文本和图像数据方面非常成功。然而,针对混合型表格数据的适配研究仍不充分。本文提出CDTD——一种用于混合型表格数据的连续扩散模型。CDTD基于评分匹配与评分插值的新颖组合,为连续特征与分类特征建立统一的连续噪声分布。我们明确指出需通过模型特定损失校准与初始化方案来异质化不同类型数据。为应对混合型表格数据的高度异质性,我们引入自适应特征或类型特异性噪声调度策略,确保各特征生成性能均衡,并优化模型容量在特征维度与扩散时间维度的分配。实验结果表明,CDTD持续超越现有最优基准模型,在捕捉特征相关性方面表现卓越,且噪声调度策略中的异质性设计可提升样本质量。可复现代码见https://github.com/muellermarkus/cdtd。