Communication is a key bottleneck for distributed graph neural network (GNN) training. This paper proposes GNNPipe, a new approach that scales the distributed full-graph deep GNN training. Being the first to use layer-level model parallelism for GNN training, GNNPipe partitions GNN layers among GPUs, each device performs the computation for a disjoint subset of consecutive GNN layers on the whole graph. Compared to graph parallelism with each GPU handling a graph partition, GNNPipe reduces the communication volume by a factor of the number of GNN layers. GNNPipe overcomes the unique challenges for pipelined layer-level model parallelism on the whole graph by partitioning it into dependent chunks, allowing the use of historical vertex embeddings, and applying specific training techniques to ensure convergence. We also propose a hybrid approach by combining GNNPipe with graph parallelism to handle large graphs, achieve better computer resource utilization and ensure model convergence. We build a general GNN training system supporting all three parallelism setting. Extensive experiments show that our method reduces the per-epoch training time by up to 2.45x (on average 1.58x) and reduces the communication volume and overhead by up to 22.89x and 27.21x (on average 8.69x and 11.60x), respectively, while achieving a comparable level of model accuracy and convergence speed compared to graph parallelism.
翻译:通信是分布式图神经网络(GNN)训练的关键瓶颈。本文提出GNNPipe,一种扩展分布式全图深度GNN训练的新方法。作为首个在GNN训练中采用层级模型并行的方法,GNNPipe将GNN层划分到多个GPU中,每个设备在整张图上对连续GNN层的不相交子集执行计算。相较于每个GPU处理一个图分区的图并行,GNNPipe将通信量降低了一个因子(该因子等于GNN层数)。GNNPipe通过将整张图划分为相互依赖的块、允许使用历史顶点嵌入以及应用特定训练技术来确保收敛,克服了在整张图上实现流水线式层级模型并行的独特挑战。我们还提出一种混合方法,将GNNPipe与图并行相结合,以处理大规模图、实现更好的计算资源利用并确保模型收敛。我们构建了一个通用GNN训练系统,支持全部三种并行设置。大量实验表明,与图并行相比,我们的方法将每轮训练时间最多减少2.45倍(平均1.58倍),通信量和通信开销分别最多减少22.89倍和27.21倍(平均8.69倍和11.60倍),同时达到可比的模型精度和收敛速度。