Score-based generative models (or diffusion models for short) have proven successful for generating text and image data. However, the adaption of this model family to tabular data of mixed-type has fallen short so far. In this paper, we propose CDTD, a Continuous Diffusion model for mixed-type Tabular Data. Specifically, we combine score matching and score interpolation to ensure a common continuous noise distribution for both continuous and categorical features alike. We counteract the high heterogeneity inherent to data of mixed-type with distinct, adaptive noise schedules per feature or per data type. The learnable noise schedules ensure optimally allocated model capacity and balanced generative capability. We homogenize the data types further with model-specific loss calibration and initialization schemes tailored to mixed-type tabular data. Our experimental results show that CDTD consistently outperforms state-of-the-art benchmark models, captures feature correlations exceptionally well, and that heterogeneity in the noise schedule design boosts the sample quality.
翻译:基于分数的生成模型(简称扩散模型)已在文本和图像数据生成方面取得显著成功。然而,该模型家族在适应混合类型表格数据方面迄今仍存在不足。本文提出CDTD,一种面向混合类型表格数据的连续扩散模型。具体而言,我们结合分数匹配与分数插值技术,为连续特征和分类特征确保共同的连续噪声分布。针对混合类型数据固有的高度异质性,我们采用按特征或按数据类型区分的自适应噪声调度机制加以应对。可学习的噪声调度确保了模型容量的最优分配与生成能力的均衡发展。通过针对混合类型表格数据定制的模型特定损失校准与初始化方案,我们进一步实现了数据类型的同质化处理。实验结果表明,CDTD在各项评估中持续超越最先进的基准模型,能出色捕捉特征相关性,且噪声调度设计的异质性显著提升了生成样本的质量。