The success of self-supervised learning in computer vision and natural language processing has motivated pretraining methods on tabular data. However, most existing tabular self-supervised learning models fail to leverage information across multiple data tables and cannot generalize to new tables. In this work, we introduce XTab, a framework for cross-table pretraining of tabular transformers on datasets from various domains. We address the challenge of inconsistent column types and quantities among tables by utilizing independent featurizers and using federated learning to pretrain the shared component. Tested on 84 tabular prediction tasks from the OpenML-AutoML Benchmark (AMLB), we show that (1) XTab consistently boosts the generalizability, learning speed, and performance of multiple tabular transformers, (2) by pretraining FT-Transformer via XTab, we achieve superior performance than other state-of-the-art tabular deep learning models on various tasks such as regression, binary, and multiclass classification.
翻译:自监督学习在计算机视觉和自然语言处理领域的成功推动了表格数据预训练方法的发展。然而,现有的大多数表格自监督学习模型无法有效利用跨多个数据表的信息,也无法泛化到新数据表。本研究提出XTab框架,该框架针对来自不同领域的多表数据集,实现表格Transformer的跨表预训练。我们通过采用独立特征化器并结合联邦学习对共享组件进行预训练,解决了不同数据表之间列类型与数量不一致的挑战。在OpenML-AutoML基准(AMLB)的84项表格预测任务上的测试表明:(1)XTab能够持续提升多种表格Transformer的泛化能力、学习速度与性能;(2)通过使用XTab对FT-Transformer进行预训练,我们在回归、二分类和多分类等各类任务中取得了优于其他最先进表格深度学习模型的性能。