Handling missing node features is a key challenge for deploying Graph Neural Networks (GNNs) in real-world domains such as healthcare and sensor networks. Existing studies mostly address relatively benign scenarios, namely benchmark datasets with (a) high-dimensional but sparse node features and (b) incomplete data generated under Missing Completely At Random (MCAR) mechanisms. For (a), we theoretically prove that high sparsity substantially limits the information loss caused by missingness, making all models appear robust and preventing a meaningful comparison of their performance. To overcome this limitation, we introduce one synthetic and three real-world datasets with dense, semantically meaningful features. For (b), we move beyond MCAR and design evaluation protocols with more realistic missingness mechanisms. Moreover, we provide a theoretical background to state explicit assumptions on the missingness process and analyze their implications for different methods. Building on this analysis, we propose GNNmim, a simple yet effective baseline for node classification with incomplete feature data. Experiments show that GNNmim is competitive with respect to specialized architectures across diverse datasets and missingness regimes.
翻译:处理缺失节点特征是图神经网络(GNNs)在医疗保健和传感器网络等现实领域部署时面临的关键挑战。现有研究大多针对相对良性的场景,即具有以下特点的基准数据集:(a)高维但稀疏的节点特征,以及(b)在完全随机缺失(MCAR)机制下生成的不完整数据。对于(a),我们从理论上证明,高稀疏性极大地限制了由缺失引起的信息损失,使得所有模型都表现出鲁棒性,从而阻碍了对它们性能的有意义比较。为克服这一局限,我们引入了一个合成数据集和三个具有密集且语义丰富特征的真实世界数据集。对于(b),我们超越了MCAR,设计了采用更现实缺失机制的评估方案。此外,我们提供了理论背景,以明确陈述关于缺失过程的假设,并分析这些假设对不同方法的影响。基于此分析,我们提出了GNNmim——一种用于不完整特征数据节点分类的简单而有效的基线方法。实验表明,GNNmim在多样化数据集和缺失机制下,与专用架构相比具有竞争力。