We present \emph{TabRet}, a pre-trainable Transformer-based model for tabular data. TabRet is designed to work on a downstream task that contains columns not seen in pre-training. Unlike other methods, TabRet has an extra learning step before fine-tuning called \emph{retokenizing}, which calibrates feature embeddings based on the masked autoencoding loss. In experiments, we pre-trained TabRet with a large collection of public health surveys and fine-tuned it on classification tasks in healthcare, and TabRet achieved the best AUC performance on four datasets. In addition, an ablation study shows retokenizing and random shuffle augmentation of columns during pre-training contributed to performance gains. The code is available at https://github.com/pfnet-research/tabret .
翻译:摘要:本文提出\emph{TabRet},一种基于Transformer的可预训练表格数据模型。TabRet专为处理包含预训练中未见列的迁移任务而设计。与其他方法不同,TabRet在微调前增加一个名为\emph{重标记化(retokenizing)}的额外学习步骤,该步骤基于掩码自编码损失校准特征嵌入。实验中,我们利用大规模公共健康调查数据集预训练TabRet,并在医疗领域的分类任务上进行微调。结果显示,TabRet在四个数据集上取得了最优AUC性能。此外,消融研究表明,预训练期间采用重标记化与列随机打乱增强方法有效提升了模型性能。代码已开源至https://github.com/pfnet-research/tabret。