Graph neural networks (GNNs) leverage the connectivity and structure of real-world graphs to learn intricate properties and relationships between nodes. Many real-world graphs exceed the memory capacity of a GPU due to their sheer size, and training GNNs on such graphs requires techniques such as mini-batch sampling to scale. The alternative approach of distributed full-graph training suffers from high communication overheads and load imbalance due to the irregular structure of graphs. We propose a three-dimensional (3D) parallel approach for full-graph training that tackles these issues and scales to billion-edge graphs. In addition, we introduce optimizations such as a double permutation scheme for load balancing, and a performance model to predict the optimal 3D configuration of our parallel implementation -- Plexus. We evaluate Plexus on six different graph datasets and show scaling results on up to 2048 GPUs of Perlmutter, and 1024 GPUs of Frontier. Plexus achieves unprecedented speedups of 2.3-12.5x over prior state of the art, and a reduction in time-to-solution by 5.2-8.7x on Perlmutter and 7.0-54.2x on Frontier.
翻译:图神经网络(GNNs)通过利用现实世界图的连接性和结构,学习节点之间复杂的属性和关系。由于现实世界图的庞大规模,许多图超出了单个GPU的内存容量,因此需要采用小批量采样等技术来扩展GNN训练。分布式全图训练作为替代方案,因图结构的不规则性而面临高通信开销和负载不均衡的问题。我们提出了一种三维(3D)并行全图训练方法,以解决这些问题并扩展至十亿边图。此外,我们引入了多项优化,包括用于负载均衡的双重置换方案,以及用于预测并行实现——Plexus——最优三维配置的性能模型。我们在六个不同的图数据集上评估Plexus,并在Perlmutter超级计算机的2048个GPU和Frontier超级计算机的1024个GPU上展示了扩展结果。Plexus相比先前最先进技术实现了前所未有的2.3-12.5倍加速,在Perlmutter上将求解时间缩短了5.2-8.7倍,在Frontier上缩短了7.0-54.2倍。