Missing data is a pervasive problem in tabular settings. Existing solutions range from simple averaging to complex generative adversarial networks. However, due to huge variance in performance across real-world domains and time-consuming hyperparameter tuning, no default imputation method exists. Building on TabPFN, a recent tabular foundation model for supervised learning, we propose TabImpute, a pre-trained transformer that delivers accurate and fast zero-shot imputations requiring no fitting or hyperparameter tuning at inference-time. To train and evaluate TabImpute, we introduce (i) an entry-wise featurization for tabular settings, which enables a $100\times$ speedup over the previous TabPFN imputation method, (ii) a synthetic training data generation pipeline incorporating realistic missingness patterns, which boosts test-time performance, and (iii) MissBench, a comprehensive benchmark for evaluation of imputation methods with $42$ OpenML datasets and $13$ missingness patterns. MissBench spans domains such as medicine, finance, and engineering, showcasing TabImpute's robust performance compared to $11$ established imputation methods.
翻译:缺失数据是表格数据中普遍存在的问题。现有解决方案从简单的均值填补到复杂的生成对抗网络不一而足。然而,由于现实领域间性能差异巨大且超参数调优耗时,目前尚不存在默认的填补方法。基于近期用于监督学习的表格基础模型TabPFN,我们提出了TabImpute——一种预训练的Transformer模型,能够在推理时无需拟合或超参数调优即可实现准确快速的零样本填补。为训练和评估TabImpute,我们引入了:(i)针对表格场景的条目级特征化方法,相比先前的TabPFN填补方法实现$100\times$加速;(ii)融合现实缺失模式的合成训练数据生成流程,有效提升测试时性能;(iii)MissBench——包含$42$个OpenML数据集和$13$种缺失模式的综合性填补方法评估基准。MissBench覆盖医学、金融、工程等多个领域,通过与$11$种成熟填补方法的对比,展现了TabImpute的鲁棒性能。