Tabular data sets with varying missing values are prepared for machine learning using an arbitrary imputation strategy. Synthetic values generated by imputation models often raise concerns regarding data quality and the reliability of data-driven outcomes. To address these concerns, this article proposes an imputation-free incremental attention learning (IFIAL) method for tabular data with missing values. A pair of attention masks is derived and retrofitted to a transformer to directly streamline tabular data without imputing or initializing missing values. The proposed method incrementally learns partitions of overlapping and fixed-size feature sets to enhance the performance of the transformer. The average classification performance rank order across 17 diverse tabular data sets highlights the superiority of IFIAL over 11 state-of-the-art learning methods with or without missing value imputations. Additional experiments corroborate the robustness of IFIAL to varying types and proportions of missing data, demonstrating its superiority over methods that rely on explicit imputations. A feature partition size equal to one-half the original feature space yields the best trade-off between computational efficiency and predictive performance. IFIAL is one of the first solutions that enables deep attention models to learn directly from tabular data, eliminating the need to impute missing values. %without the need for imputing missing values. The source code for this paper is publicly available.
翻译:为机器学习准备含有不同缺失值的表格数据集时,通常采用任意的插补策略。插补模型生成的合成值常引发对数据质量及数据驱动结果可靠性的担忧。为解决这些问题,本文提出一种针对含缺失值表格数据的无需插补的增量注意力学习(IFIAL)方法。通过推导并改造一对注意力掩码适配Transformer,使其能直接处理原始表格数据而无需对缺失值进行插补或初始化。该方法通过增量学习重叠且固定尺寸的特征集分区来提升Transformer的性能。在17个多样化表格数据集上的平均分类性能排序表明,IFIAL优于11种当前最先进的含或不含缺失值插补的学习方法。补充实验证实了IFIAL对不同类型及比例的缺失数据均具有鲁棒性,其性能优于依赖显式插补的方法。当特征分区尺寸设定为原始特征空间的一半时,可在计算效率与预测性能间取得最佳平衡。IFIAL是首批使深度注意力模型能够直接从表格数据中学习而无需插补缺失值的解决方案之一。本文源代码已公开。