Current distributed full-graph GNN training methods adopt a variant of data parallelism, namely graph parallelism, in which the whole graph is divided into multiple partitions (subgraphs) and each GPU processes one of them. This incurs high communication overhead because of the inter-partition message passing at each layer. To this end, we proposed a new training method named GNNPipe that adopts model parallelism instead, which has a lower worst-case asymptotic communication complexity than graph parallelism. To ensure high GPU utilization, we proposed to combine model parallelism with a chunk-based pipelined training method, in which each GPU processes a different chunk of graph data at different layers concurrently. We further proposed hybrid parallelism that combines model and graph parallelism when the model-level parallelism is insufficient. We also introduced several tricks to ensure convergence speed and model accuracies to accommodate embedding staleness introduced by pipelining. Extensive experiments show that our method reduces the per-epoch training time by up to 2.45x (on average 2.03x) and reduces the communication volume and overhead by up to 22.51x and 27.21x (on average 10.27x and 14.96x), respectively, while achieving a comparable level of model accuracy and convergence speed compared to graph parallelism.
翻译:现有的分布式全图GNN训练方法采用数据并行的一种变体——图并行,即将完整图划分为多个分区(子图),每个GPU处理其中一个分区。由于每层跨分区消息传递带来的高通信开销,我们提出了一种名为GNNPipe的新训练方法,该方法采用模型并行策略,其最坏情况下的渐近通信复杂度低于图并行。为确保高GPU利用率,我们提出将模型并行与基于块的流水线训练方法相结合,使每个GPU在不同层上同时处理不同图数据块。当模型级并行度不足时,我们进一步提出结合模型并行与图并行的混合并行策略。此外,我们引入若干技巧来保障收敛速度与模型精度,以适应流水线引入的嵌入陈旧性问题。大量实验表明,与图并行相比,本方法将每轮训练时间最多降低2.45倍(平均2.03倍),通信量与通信开销分别最多降低22.51倍和27.21倍(平均10.27倍和14.96倍),同时达到可比的模型精度与收敛速度。