Large Scale Transfer Learning for Tabular Data via Language Modeling

Tabular data -- structured, heterogeneous, spreadsheet-style data with rows and columns -- is widely used in practice across many domains. However, while recent foundation models have reduced the need for developing task-specific datasets and predictors in domains such as language modeling and computer vision, this transfer learning paradigm has not had similar impact in the tabular domain. In this work, we seek to narrow this gap and present TabuLa-8B, a language model for tabular prediction. We define a process for extracting a large, high-quality training dataset from the TabLib corpus, proposing methods for tabular data filtering and quality control. Using the resulting dataset, which comprises over 2.1B rows from over 4M unique tables, we fine-tune a Llama 3-8B large language model (LLM) for tabular data prediction (classification and binned regression) using a novel packing and attention scheme for tabular prediction. Through evaluation across a test suite of 329 datasets, we find that TabuLa-8B has zero-shot accuracy on unseen tables that is over 15 percentage points (pp) higher than random guessing, a feat that is not possible with existing state-of-the-art tabular prediction models (e.g. XGBoost, TabPFN). In the few-shot setting (1-32 shots), without any fine-tuning on the target datasets, TabuLa-8B is 5-15 pp more accurate than XGBoost and TabPFN models that are explicitly trained on equal, or even up to 16x more data. We release our model, code, and data along with the publication of this paper.

翻译：表格数据——即具有行和列的结构化、异质化、电子表格风格的数据——在实践中广泛应用于众多领域。然而，尽管近期的基础模型已在语言建模和计算机视觉等领域减少了对开发任务特定数据集和预测器的需求，但这种迁移学习范式在表格数据领域尚未产生类似的影响。本研究旨在缩小这一差距，提出了用于表格预测的语言模型 TabuLa-8B。我们定义了一个从 TabLib 语料库中提取大规模高质量训练数据集的流程，提出了表格数据过滤与质量控制的方法。利用由此获得的数据集（包含来自超过 400 万个独立表格的 21 亿行数据），我们采用一种新颖的用于表格预测的打包与注意力机制，对 Llama 3-8B 大语言模型（LLM）进行微调，以用于表格数据预测（分类与分箱回归）。通过在包含 329 个数据集的测试套件上进行评估，我们发现 TabuLa-8B 在未见表格上的零样本准确率比随机猜测高出超过 15 个百分点（pp），这是现有最先进的表格预测模型（例如 XGBoost、TabPFN）无法实现的。在少样本设置（1-32 个样本）下，无需对目标数据集进行任何微调，TabuLa-8B 的准确率比在同等数据量、甚至多达 16 倍数据量上显式训练的 XGBoost 和 TabPFN 模型高出 5-15 pp。我们随本文发表一并公开了模型、代码和数据。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日