Pitfalls in Link Prediction with Graph Neural Networks: Understanding the Impact of Target-link Inclusion & Better Practices

While Graph Neural Networks (GNNs) are remarkably successful in a variety of high-impact applications, we demonstrate that, in link prediction, the common practices of including the edges being predicted in the graph at training and/or test have outsized impact on the performance of low-degree nodes. We theoretically and empirically investigate how these practices impact node-level performance across different degrees. Specifically, we explore three issues that arise: (I1) overfitting; (I2) distribution shift; and (I3) implicit test leakage. The former two issues lead to poor generalizability to the test data, while the latter leads to overestimation of the model's performance and directly impacts the deployment of GNNs. To address these issues in a systematic way, we introduce an effective and efficient GNN training framework, SpotTarget, which leverages our insight on low-degree nodes: (1) at training time, it excludes a (training) edge to be predicted if it is incident to at least one low-degree node; and (2) at test time, it excludes all test edges to be predicted (thus, mimicking real scenarios of using GNNs, where the test data is not included in the graph). SpotTarget helps researchers and practitioners adhere to best practices for learning from graph data, which are frequently overlooked even by the most widely-used frameworks. Our experiments on various real-world datasets show that SpotTarget makes GNNs up to 15x more accurate in sparse graphs, and significantly improves their performance for low-degree nodes in dense graphs.

翻译：尽管图神经网络（GNN）在众多高影响力应用中取得了显著成功，但本研究表明，在链路预测任务中，在训练和/或测试阶段将待预测边包含在图中的常见做法，会对低度节点的性能产生超常影响。我们通过理论与实验双重验证，系统研究了这些做法如何在不同度数节点上影响性能表现。具体而言，我们探讨了三个突出问题：(I1)过拟合；(I2)分布偏移；(I3)隐式测试泄露。前两个问题导致模型对测试数据的泛化能力下降，而后者则会造成模型性能的高估，直接影响GNN的实际部署。为系统解决这些问题，我们提出了一种高效且实用的GNN训练框架SpotTarget，其核心思想基于对低度节点的深入洞察：(1) 训练阶段，若待预测（训练）边至少与一个低度节点关联，则将其排除；(2) 测试阶段，排除所有待预测测试边（从而模拟真实场景中GNN的使用，即测试数据不包含在图结构中）。SpotTarget有助于研究者和从业者遵循从图数据中学习的最佳实践，这些实践即便是最广泛使用的框架也经常被忽视。在多个真实世界数据集上的实验表明，SpotTarget可使GNN在稀疏图中的准确率提升高达15倍，并在密集图中显著改善低度节点的性能表现。