In this paper, we aim to address a significant challenge in the field of missing data imputation: identifying and leveraging the interdependencies among features to enhance missing data imputation for tabular data. We introduce a novel framework named the Bipartite and Complete Directed Graph Neural Network (BCGNN). Within BCGNN, observations and features are differentiated as two distinct node types, and the values of observed features are converted into attributed edges linking them. The bipartite segment of our framework inductively learns embedding representations for nodes, efficiently utilizing the comprehensive information encapsulated in the attributed edges. In parallel, the complete directed graph segment adeptly outlines and communicates the complex interdependencies among features. When compared to contemporary leading imputation methodologies, BCGNN consistently outperforms them, achieving a noteworthy average reduction of 15% in mean absolute error for feature imputation tasks under different missing mechanisms. Our extensive experimental investigation confirms that an in-depth grasp of the interdependence structure substantially enhances the model's feature embedding ability. We also highlight the model's superior performance in label prediction tasks involving missing data, and its formidable ability to generalize to unseen data points.
翻译:本文旨在解决缺失数据填补领域中的一个关键挑战:识别并利用特征间的相互依赖关系以增强表格数据的缺失值填补效果。我们提出了一种名为二部图与完全有向图神经网络(BCGNN)的新框架。在BCGNN中,观测样本与特征被区分为两种不同的节点类型,观测到的特征值被转化为连接它们的属性化边。框架中的二部图部分通过归纳学习生成节点的嵌入表示,有效利用了属性化边中所包含的完整信息。与此同时,完全有向图部分则能精准刻画并传递特征间复杂的相互依赖关系。与当前领先的填补方法相比,BCGNN在不同缺失机制下均表现出更优性能,在特征填补任务中实现了平均绝对误差显著降低15%的成效。我们通过大量实验验证,深入理解特征间依赖结构能显著提升模型的特征嵌入能力。研究还表明,该模型在含缺失数据的标签预测任务中具有优越性能,并展现出对未见数据点强大的泛化能力。