Addressing the Impact of Localized Training Data in Graph Neural Networks

Graph Neural Networks (GNNs) have achieved notable success in learning from graph-structured data, owing to their ability to capture intricate dependencies and relationships between nodes. They excel in various applications, including semi-supervised node classification, link prediction, and graph generation. However, it is important to acknowledge that the majority of state-of-the-art GNN models are built upon the assumption of an in-distribution setting, which hinders their performance on real-world graphs with dynamic structures. In this article, we aim to assess the impact of training GNNs on localized subsets of the graph. Such restricted training data may lead to a model that performs well in the specific region it was trained on but fails to generalize and make accurate predictions for the entire graph. In the context of graph-based semi-supervised learning (SSL), resource constraints often lead to scenarios where the dataset is large, but only a portion of it can be labeled, affecting the model's performance. This limitation affects tasks like anomaly detection or spam detection when labeling processes are biased or influenced by human subjectivity. To tackle the challenges posed by localized training data, we approach the problem as an out-of-distribution (OOD) data issue by by aligning the distributions between the training data, which represents a small portion of labeled data, and the graph inference process that involves making predictions for the entire graph. We propose a regularization method to minimize distributional discrepancies between localized training data and graph inference, improving model performance on OOD data. Extensive tests on popular GNN models show significant performance improvement on three citation GNN benchmark datasets. The regularization approach effectively enhances model adaptation and generalization, overcoming challenges posed by OOD data.

翻译：图神经网络（GNN）凭借其捕捉节点间复杂依赖关系和关联的能力，在图结构数据学习中取得了显著成功。它们在各类应用中表现出色，包括半监督节点分类、链路预测和图生成。然而，需要承认的是，大多数最先进的GNN模型都建立在分布内设置的假设之上，这限制了其在具有动态结构的现实世界图上的性能。本文旨在评估在图的局部子集上训练GNN所产生的影响。这种受限的训练数据可能导致模型在特定训练区域表现良好，但无法泛化至整个图并做出准确预测。在基于图的半监督学习（SSL）背景下，资源限制常导致数据集规模庞大但仅部分数据可被标注的情况，从而影响模型性能。当标注过程存在偏差或受人类主观性影响时，这种局限性会波及异常检测或垃圾邮件检测等任务。为应对局部训练数据带来的挑战，我们将该问题视为分布外（OOD）数据处理问题，通过对齐训练数据（代表少量标注数据）的分布与涉及全图预测的图推理过程的分布来解决。我们提出一种正则化方法，以最小化局部训练数据与图推理之间的分布差异，从而提升模型在OOD数据上的性能。在主流GNN模型上的广泛测试表明，该方法在三个引文GNN基准数据集上取得了显著的性能提升。该正则化方法有效增强了模型的适配与泛化能力，克服了OOD数据带来的挑战。