Advances in generative modeling have recently been adapted to tabular data containing discrete and continuous features. However, generating mixed-type features that combine discrete states with an otherwise continuous distribution in a single feature remains challenging. We advance the state-of-the-art in diffusion models for tabular data with a cascaded approach. We first generate a low-resolution version of a tabular data row, that is, the collection of the purely categorical features and a coarse categorical representation of numerical features. Next, this information is leveraged in the high-resolution flow matching model via a novel guided conditional probability path and data-dependent coupling. The low-resolution representation of numerical features explicitly accounts for discrete outcomes, such as missing or inflated values, and therewith enables a more faithful generation of mixed-type features. We formally prove that this cascade tightens the transport cost bound. The results indicate that our model generates significantly more realistic samples and captures distributional details more accurately, for example, the detection score increases by 40%.
翻译:生成建模的最新进展已成功应用于包含离散与连续特征的表格数据。然而,对于单个特征中同时包含离散状态与连续分布的混合类型特征,其生成仍具挑战性。本文通过级联方法推动了表格数据扩散模型的现有技术水平。我们首先生成表格数据行的低分辨率版本,即纯分类特征集合与数值特征的粗粒度分类表示。随后,通过一种新颖的引导条件概率路径与数据依赖耦合机制,将该信息融入高分辨率流匹配模型。数值特征的低分辨率表示显式地考虑了缺失值或膨胀值等离散结果,从而能够更真实地生成混合类型特征。我们严格证明了该级联方法能够收紧传输成本上界。实验结果表明,本模型生成的样本显著更真实,且能更精确地捕捉分布细节,例如检测分数提升了40%。