Advances in generative modeling have recently been adapted to tabular data containing discrete and continuous features. However, generating mixed-type features that combine discrete states with an otherwise continuous distribution in a single feature remains challenging. We advance the state-of-the-art in diffusion models for tabular data with a cascaded approach. We first generate a low-resolution version of a tabular data row, that is, the collection of the purely categorical features and a coarse categorical representation of numerical features. Next, this information is leveraged in the high-resolution flow matching model via a novel guided conditional probability path and data-dependent coupling. The low-resolution representation of numerical features explicitly accounts for discrete outcomes, such as missing or inflated values, and therewith enables a more faithful generation of mixed-type features. We formally prove that this cascade tightens the transport cost bound. The results indicate that our model generates significantly more realistic samples and captures distributional details more accurately, for example, the detection score improves by 51.9\%. Code is available at https://github.com/muellermarkus/tabcascade.
翻译:生成式建模的最新进展已适用于包含离散和连续特征的表格数据。然而,生成兼具离散状态与连续分布的混合类型特征仍具挑战性。我们提出一种级联方法,推进了表格数据扩散模型的前沿技术。首先,生成表格数据行的低分辨率版本,即纯类别特征集合与数值特征的粗粒度类别化表示。随后,通过一种基于新型引导条件概率路径和数据依赖耦合机制的高分辨率流匹配模型,利用该信息进行建模。数值特征的低分辨率表示显式处理了缺失值或膨胀值等离散结果,从而更忠实地生成混合类型特征。本文从理论上证明,该级联方法可压缩传输代价边界。结果表明,我们的模型能够生成显著更真实的样本,并更精确地捕捉分布细节——例如,检测得分提升了51.9%。代码开源于https://github.com/muellermarkus/tabcascade。