Tabular data is one of the most ubiquitous modalities, yet the literature on tabular generative foundation models is lagging far behind its text and vision counterparts. Creating such a model is hard, due to the heterogeneous feature spaces of different tabular datasets, tabular metadata (e.g. dataset description and feature headers), and tables lacking prior knowledge (e.g. feature order). In this work we propose LaTable: a novel tabular diffusion model that addresses these challenges and can be trained across different datasets. Through extensive experiments we find that LaTable outperforms baselines on in-distribution generation, and that finetuning LaTable can generate out-of-distribution datasets better with fewer samples. On the other hand, we explore the poor zero-shot performance of LaTable, and what it may teach us about building generative tabular foundation models with better zero- and few-shot generation capabilities.
翻译:表格数据是最普遍的数据形式之一,然而关于表格生成基础模型的研究文献远落后于文本和视觉领域。构建此类模型面临诸多挑战:不同表格数据集具有异构特征空间、表格元数据(如数据集描述和特征标题)的存在,以及表格缺乏先验知识(如特征顺序)。本工作提出LaTable:一种新颖的表格扩散模型,能够应对这些挑战并实现跨数据集训练。通过大量实验,我们发现LaTable在分布内生成任务上优于基线模型,且微调后的LaTable能够以更少样本生成更优质的分布外数据集。另一方面,我们探究了LaTable在零样本场景下的性能局限,并探讨了这些发现对构建具有更优零样本/少样本生成能力的表格生成基础模型的启示。