Data imputation is an effective way to handle missing data, which is common in practical applications. In this study, we propose and test a novel data imputation process that achieve two important goals: (1) preserve the row-wise similarities among observations and column-wise contextual relationships among features in the feature matrix, and (2) tailor the imputation process to specific downstream label prediction task. The proposed imputation process uses Transformer network and graph structure learning to iteratively refine the contextual relationships among features and similarities among observations. Moreover, it uses a meta-learning framework to select features that are influential to the downstream prediction task of interest. We conduct experiments on real-world large data sets, and show that the proposed imputation process consistently improves imputation and label prediction performance over a variety of benchmark methods.
翻译:摘要:数据插补是处理实际应用中常见缺失数据的有效方法。在本研究中,我们提出并测试了一种新型数据插补流程,实现了两个重要目标:(1) 保留特征矩阵中观测值之间的行级相似性及特征之间的列级上下文关系;(2) 使插补过程适配特定的下游标签预测任务。该插补流程利用Transformer网络和图结构学习,迭代优化特征间的上下文关系与观测值间的相似性。此外,它采用元学习框架筛选对目标下游预测任务具有影响力的特征。我们在真实大规模数据集上开展实验,结果表明,与多种基准方法相比,所提插补流程在插补性能和标签预测性能上均有持续提升。