Tabular data -- structured, heterogeneous, spreadsheet-style data with rows and columns -- is widely used in practice across many domains. However, while recent foundation models have reduced the need for developing task-specific datasets and predictors in domains such as language modeling and computer vision, this transfer learning paradigm has not had similar impact in the tabular domain. In this work, we seek to narrow this gap and present TabuLa-8B, a language model for tabular prediction. We define a process for extracting a large, high-quality training dataset from the TabLib corpus, proposing methods for tabular data filtering and quality control. Using the resulting dataset, which comprises over 2.1B rows from over 4M unique tables, we fine-tune a Llama 3-8B large language model (LLM) for tabular data prediction (classification and binned regression) using a novel packing and attention scheme for tabular prediction. Through evaluation across a test suite of 329 datasets, we find that TabuLa-8B has zero-shot accuracy on unseen tables that is over 15 percentage points (pp) higher than random guessing, a feat that is not possible with existing state-of-the-art tabular prediction models (e.g. XGBoost, TabPFN). In the few-shot setting (1-32 shots), without any fine-tuning on the target datasets, TabuLa-8B is 5-15 pp more accurate than XGBoost and TabPFN models that are explicitly trained on equal, or even up to 16x more data. We release our model, code, and data along with the publication of this paper.
翻译:表格数据——即具有行和列的结构化、异质化、电子表格风格的数据——在实践中广泛应用于众多领域。然而,尽管近期的基础模型已在语言建模和计算机视觉等领域减少了对开发任务特定数据集和预测器的需求,但这种迁移学习范式在表格数据领域尚未产生类似的影响。本研究旨在缩小这一差距,提出了用于表格预测的语言模型 TabuLa-8B。我们定义了一个从 TabLib 语料库中提取大规模高质量训练数据集的流程,提出了表格数据过滤与质量控制的方法。利用由此获得的数据集(包含来自超过 400 万个独立表格的 21 亿行数据),我们采用一种新颖的用于表格预测的打包与注意力机制,对 Llama 3-8B 大语言模型(LLM)进行微调,以用于表格数据预测(分类与分箱回归)。通过在包含 329 个数据集的测试套件上进行评估,我们发现 TabuLa-8B 在未见表格上的零样本准确率比随机猜测高出超过 15 个百分点(pp),这是现有最先进的表格预测模型(例如 XGBoost、TabPFN)无法实现的。在少样本设置(1-32 个样本)下,无需对目标数据集进行任何微调,TabuLa-8B 的准确率比在同等数据量、甚至多达 16 倍数据量上显式训练的 XGBoost 和 TabPFN 模型高出 5-15 pp。我们随本文发表一并公开了模型、代码和数据。