Score-based generative models (or diffusion models for short) have proven successful across many domains in generating text and image data. However, the consideration of mixed-type tabular data with this model family has fallen short so far. Existing research mainly combines different diffusion processes without explicitly accounting for the feature heterogeneity inherent to tabular data. In this paper, we combine score matching and score interpolation to ensure a common type of continuous noise distribution that affects both continuous and categorical features alike. Further, we investigate the impact of distinct noise schedules per feature or per data type. We allow for adaptive, learnable noise schedules to ensure optimally allocated model capacity and balanced generative capability. Results show that our model consistently outperforms state-of-the-art benchmark models and that accounting for heterogeneity within the noise schedule design boosts the sample quality.
翻译:基于分数的生成模型(简称扩散模型)已在文本和图像数据的生成任务中展现出显著成效。然而,该模型系列对混合类型表格数据的处理至今仍显不足。现有研究主要采用不同扩散过程的简单组合,未能充分考虑表格数据固有的特征异质性。本文通过融合分数匹配与分数插值技术,确保连续噪声分布能够同等作用于连续特征与分类特征。进一步地,我们探究了基于单个特征或数据类型独立设置噪声调度方案的影响机制。通过引入自适应、可学习的噪声调度策略,实现模型容量的优化分配与生成能力的均衡配置。实验结果表明,本模型在性能上持续超越当前最优基准模型,而针对噪声调度方案中的异质性设计能有效提升样本生成质量。