To analyze the scaling potential of deep tabular representation learning models, we introduce a novel Transformer-based architecture specifically tailored to tabular data and cross-table representation learning by utilizing table-specific tokenizers and a shared Transformer backbone. Our training approach encompasses both single-table and cross-table models, trained via missing value imputation through a self-supervised masked cell recovery objective. To understand the scaling behavior of our method, we train models of varying sizes, ranging from approximately $10^4$ to $10^7$ parameters. These models are trained on a carefully curated pretraining dataset, consisting of 135M training tokens sourced from 76 diverse datasets. We assess the scaling of our architecture in both single-table and cross-table pretraining setups by evaluating the pretrained models using linear probing on a curated set of benchmark datasets and comparing the results with conventional baselines.
翻译:为分析深度表格表示学习模型的扩展潜力,我们提出一种新型基于Transformer的架构,该架构通过使用表格专用分词器和共享Transformer主干网络,专为表格数据及跨表格表示学习设计。我们的训练方法涵盖单表格和跨表格模型,通过自监督掩码单元恢复目标进行缺失值插补训练。为理解方法的扩展行为,我们训练了参数规模从约10^4到10^7不等的多种尺寸模型。这些模型在精心构建的预训练数据集(包含来自76个不同数据集的1.35亿训练单元)上进行训练。通过在线性探针评估方法下对精选基准数据集进行预训练模型评估,并与传统基线方法进行对比,我们分别评估了架构在单表格和跨表格预训练配置中的扩展性能。