A Comprehensive Survey on Distributed Training of Graph Neural Networks

Graph neural networks (GNNs) have been demonstrated to be a powerful algorithmic model in broad application fields for their effectiveness in learning over graphs. To scale GNN training up for large-scale and ever-growing graphs, the most promising solution is distributed training which distributes the workload of training across multiple computing nodes. At present, the volume of related research on distributed GNN training is exceptionally vast, accompanied by an extraordinarily rapid pace of publication. Moreover, the approaches reported in these studies exhibit significant divergence. This situation poses a considerable challenge for newcomers, hindering their ability to grasp a comprehensive understanding of the workflows, computational patterns, communication strategies, and optimization techniques employed in distributed GNN training. As a result, there is a pressing need for a survey to provide correct recognition, analysis, and comparisons in this field. In this paper, we provide a comprehensive survey of distributed GNN training by investigating various optimization techniques used in distributed GNN training. First, distributed GNN training is classified into several categories according to their workflows. In addition, their computational patterns and communication patterns, as well as the optimization techniques proposed by recent work are introduced. Second, the software frameworks and hardware platforms of distributed GNN training are also introduced for a deeper understanding. Third, distributed GNN training is compared with distributed training of deep neural networks, emphasizing the uniqueness of distributed GNN training. Finally, interesting issues and opportunities in this field are discussed.

翻译：图神经网络因其在图数据学习中的有效性，已被证明是一种强大的算法模型。为应对大规模及持续增长的图结构数据，分布式训练成为最具前景的解决方案，通过将训练任务分配至多个计算节点实现负载均衡。当前，分布式图神经网络训练的相关研究体量极为庞大，且论文发表速度异常迅速。同时，现有研究提出的方法之间存在显著差异。这一现状给初学者带来了巨大挑战，使其难以全面理解分布式图神经网络训练的工作流程、计算模式、通信策略及优化技术。因此，亟需一篇综述性论文对该领域进行正确的识别、分析与比较。本文通过系统梳理分布式图神经网络训练中的各类优化技术，提供了全面综述。首先，根据工作流程将分布式图神经网络训练划分为多个类别，并详细介绍其计算模式、通信模式及最新研究提出的优化技术。其次，深入探讨分布式图神经网络的软件框架与硬件平台。再次，将分布式图神经网络训练与深度神经网络分布式训练进行比较，凸显前者的独特性。最后，讨论该领域待解决的关键问题与未来机遇。