Graph neural networks are among the most successful machine learning models for relational datasets like metabolic, transportation, and social networks. Yet the determinants of their strong generalization for diverse interactions encoded in the data are not well understood. Methods from statistical learning theory do not explain emergent phenomena such as double descent or the dependence of risk on the nature of interactions. We use analytical tools from statistical physics and random matrix theory to precisely characterize generalization in simple graph convolution networks on the contextual stochastic block model. The derived curves are phenomenologically rich: they explain the distinction between learning on homophilic and heterophilic and they predict double descent whose existence in GNNs has been questioned by recent work. We show how risk depends on the interplay between the noise in the graph, noise in the features, and the proportion of nodes used for training. Our analysis predicts qualitative behavior not only of a stylized graph learning model but also to complex GNNs on messy real-world datasets. As a case in point, we use these analytic insights about heterophily and self-loop signs to improve performance of state-of-the-art graph convolution networks on several heterophilic benchmarks by a simple addition of negative self-loop filters.
翻译:图神经网络是代谢网络、交通网络和社交网络等关系型数据集上最成功的机器学习模型之一。然而,编码在数据中的多样化交互如何决定其强泛化能力尚不明确。统计学习理论的方法无法解释双下降等涌现现象或风险对交互性质的依赖性。我们利用统计物理和随机矩阵理论的分析工具,精确刻画了上下文随机块模型上简单图卷积网络的泛化特性。推导出的曲线在现象层面上极为丰富:它们解释了同质性与异质性学习之间的差异,并预测了双下降现象——近期研究曾质疑该现象在图神经网络中的存在性。我们展示了风险如何取决于图噪声、特征噪声与训练节点比例之间的相互作用。本分析不仅能预测风格化图学习模型的定性行为,还能推广至复杂图神经网络在真实世界杂乱数据集上的表现。作为例证,我们利用关于异质性与自环符号的分析洞见,通过简单添加负自环滤波器,在多个异质性基准上提升了最先进图卷积网络的性能。