We propose a theoretical framework for training Graph Neural Networks (GNNs) on large input graphs via training on small, fixed-size sampled subgraphs. This framework is applicable to a wide range of models, including popular sampling-based GNNs, such as GraphSAGE and FastGCN. Leveraging the theory of graph local limits, we prove that, under mild assumptions, parameters learned from training sampling-based GNNs on small samples of a large input graph are within an $\epsilon$-neighborhood of the outcome of training the same architecture on the whole graph. We derive bounds on the number of samples, the size of the graph, and the training steps required as a function of $\epsilon$. Our results give a novel theoretical understanding for using sampling in training GNNs. They also suggest that by training GNNs on small samples of the input graph, practitioners can identify and select the best models, hyperparameters, and sampling algorithms more efficiently. We empirically illustrate our results on a node classification task on large citation graphs, observing that sampling-based GNNs trained on local subgraphs 12$\times$ smaller than the original graph achieve comparable performance to those trained on the input graph.
翻译:我们提出一个理论框架,用于通过在小规模、固定大小的采样子图上训练,实现对大规模输入图的图神经网络(GNN)训练。该框架适用于多种模型,包括流行的采样型GNN,例如GraphSAGE和FastGCN。利用图局部极限理论,我们证明在温和假设下,通过对大型输入图的小样本训练采样型GNN所学的参数,与在全图上训练相同架构所得的结果处于$\epsilon$-邻域内。我们推导了样本数量、图大小以及训练步数作为$\epsilon$函数的界限。我们的结果为在训练GNN中使用采样提供了新的理论理解。同时,该结果表明,通过在输入图的小样本上训练GNN,实践者可以更高效地识别并选择最佳模型、超参数和采样算法。我们在大型引文图的节点分类任务上通过实验验证了结果,观察到在比原图小12倍的局部子图上训练的采样型GNN,其性能与在输入图上训练的GNN相当。