Missing data is a widespread problem in tabular settings. Existing solutions range from simple averaging to complex generative adversarial networks, but due to each method's large variance in performance across real-world domains and time-consuming hyperparameter tuning, no universal imputation method exists. This performance variance is particularly pronounced in small datasets, where the models have the least amount of information. Building on TabPFN, a recent tabular foundation model for supervised learning, we propose TabImpute, a pre-trained transformer that delivers accurate and fast zero-shot imputations, requiring no fitting or hyperparameter tuning at inference time. To train and evaluate TabImpute, we introduce (i) an entry-wise featurization for tabular settings, enabling a 100x speedup over the previous TabPFN imputation method, (ii) a synthetic training data generation pipeline incorporating a diverse set of missingness patterns to enhance accuracy on real-world missing data problems, and (iii) MissBench, a comprehensive benchmark with 42 OpenML tables and 13 new missingness patterns. MissBench spans domains such as medicine, finance, and engineering, showcasing TabImpute's robust performance compared to numerous established imputation methods.
翻译:缺失数据是表格数据中普遍存在的问题。现有解决方案从简单的均值填补到复杂的生成对抗网络不一而足,但由于每种方法在现实领域间的性能差异巨大且超参数调优耗时,目前尚不存在通用的填补方法。这种性能差异在小数据集中尤为明显,因为模型可获取的信息量最少。基于近期提出的监督学习表格基础模型TabPFN,我们提出了TabImpute——一种预训练的Transformer模型,能够提供准确快速的零样本填补,在推理时无需进行模型拟合或超参数调优。为训练和评估TabImpute,我们引入了:(i) 针对表格数据的逐元素特征化方法,相比之前的TabPFN填补方法实现100倍加速;(ii) 包含多样化缺失模式的合成训练数据生成流程,以提升对现实世界缺失数据问题的处理精度;(iii) MissBench——一个包含42个OpenML数据集和13种新型缺失模式的综合基准测试集。MissBench覆盖医学、金融和工程等多个领域,通过与多种成熟填补方法的对比,展现了TabImpute的鲁棒性能。