Graph neural networks (GNNs) are a type of deep learning models that are trained on graphs and have been successfully applied in various domains. Despite the effectiveness of GNNs, it is still challenging for GNNs to efficiently scale to large graphs. As a remedy, distributed computing becomes a promising solution of training large-scale GNNs, since it is able to provide abundant computing resources. However, the dependency of graph structure increases the difficulty of achieving high-efficiency distributed GNN training, which suffers from the massive communication and workload imbalance. In recent years, many efforts have been made on distributed GNN training, and an array of training algorithms and systems have been proposed. Yet, there is a lack of systematic review on the optimization techniques for the distributed execution of GNN training. In this survey, we analyze three major challenges in distributed GNN training that are massive feature communication, the loss of model accuracy and workload imbalance. Then we introduce a new taxonomy for the optimization techniques in distributed GNN training that address the above challenges. The new taxonomy classifies existing techniques into four categories that are GNN data partition, GNN batch generation, GNN execution model, and GNN communication protocol. We carefully discuss the techniques in each category. In the end, we summarize existing distributed GNN systems for multi-GPUs, GPU-clusters and CPU-clusters, respectively, and give a discussion about the future direction on distributed GNN training.
翻译:图神经网络(GNN)是一种基于图结构进行训练的深度学习模型,已在多个领域得到成功应用。尽管GNN具有有效性,但如何高效扩展到大规模图仍具挑战。分布式计算因其能够提供充足的计算资源,成为训练大规模GNN的有效解决方案。然而,图结构的依赖性增加了实现高效分布式GNN训练的难度,主要面临大规模通信和负载不均衡问题。近年来,针对分布式GNN训练已有大量研究,并提出了众多训练算法与系统。然而,目前缺乏对GNN训练分布式执行优化技术的系统性综述。本文首先分析了分布式GNN训练中的三大挑战:海量特征通信、模型精度损失及负载不均衡。随后,针对上述挑战,提出了一种新的优化技术分类方法,将现有技术划分为四类:GNN数据划分、GNN批次生成、GNN执行模型及GNN通信协议,并对每类技术进行了详细讨论。最后,分别总结了面向多GPU、GPU集群及CPU集群的现有分布式GNN系统,并探讨了分布式GNN训练的未来发展方向。