Tabular data is one of the most ubiquitous sources of information worldwide, spanning a wide variety of domains. This inherent heterogeneity has slowed the development of Tabular Foundation Models (TFMs) capable of fast generalization to unseen datasets. In-Context Learning (ICL) has recently emerged as a promising solution for TFMs, enabling dynamic adaptation to new tasks without additional tuning. While many studies have attempted to re-purpose large language models for tabular ICL, they have had limited success, so recent works have focused on developing tabular-specific foundation models. In this work, we propose an approach to combine ICL-based retrieval with self supervised learning to train tabular foundation models. We also investigate the utility of real vs. synthetic data for model pre-training, and show that real data can contain useful signal not easily captured in synthetic training. Specifically, we show that incorporating real data during the pre-training phase can lead to significantly faster training and better downstream generalization to unseen data. Our resulting model, TabDPT, achieves strong performance on both regression (CTR23) and classification (CC18) benchmarks. Importantly, we also demonstrate that with our pre-training procedure, scaling both model and data size leads to consistent performance improvements that follow power laws. This echoes scaling laws in LLMs and other foundation models, and suggests that large-scale TFMs can be achievable. We open-source our full pipeline: inference code including trained model weights can be found at github.com/layer6ai-labs/TabDPT-inference, and the training code to reproduce experiments can be found at github.com/layer6ai-labs/TabDPT-training.
翻译:表格数据是全球范围内最普遍的信息来源之一,涵盖众多领域。其固有的异质性阻碍了能够快速泛化至未见数据集的表格基础模型的发展。情境学习作为一种有前景的解决方案近期兴起,使TFM能够在不额外调优的情况下动态适应新任务。尽管许多研究尝试将大语言模型重新用于表格ICL,但成效有限,因此近期工作聚焦于开发表格专用基础模型。本研究提出一种结合基于ICL的检索与自监督学习来训练表格基础模型的方法。我们同时探究了真实数据与合成数据在模型预训练中的效用,证明真实数据可能包含合成训练难以捕获的有效信号。具体而言,我们发现预训练阶段引入真实数据可显著加速训练过程,并提升对未见数据的下游泛化能力。我们最终提出的模型TabDPT在回归(CTR23)与分类(CC18)基准测试中均表现出色。重要的是,我们还证明通过我们的预训练流程,模型规模与数据规模的同步扩增能带来遵循幂律关系的持续性能提升。这与LLM及其他基础模型中的缩放定律相呼应,表明大规模TFM具备实现可行性。我们已开源完整流程:包含训练模型权重的推理代码详见github.com/layer6ai-labs/TabDPT-inference,复现实验的训练代码详见github.com/layer6ai-labs/TabDPT-training。