Graph Neural Networks (GNN) are indispensable in learning from graph-structured data, yet their rising computational costs, especially on massively connected graphs, pose significant challenges in terms of execution performance. To tackle this, distributed-memory solutions such as partitioning the graph to concurrently train multiple replicas of GNNs are in practice. However, approaches requiring a partitioned graph usually suffer from communication overhead and load imbalance, even under optimal partitioning and communication strategies due to irregularities in the neighborhood minibatch sampling. This paper proposes practical trade-offs for improving the sampling and communication overheads for representation learning on distributed graphs (using popular GraphSAGE architecture) by developing a parameterized continuous prefetch and eviction scheme on top of the state-of-the-art Amazon DistDGL distributed GNN framework, demonstrating about 15-40% improvement in end-to-end training performance on the National Energy Research Scientific Computing Center's (NERSC) Perlmutter supercomputer for various OGB datasets.
翻译:图神经网络(GNN)是从图结构数据中学习的不可或缺的工具,但其日益增长的计算成本,尤其是在大规模连接图上的计算开销,对执行性能提出了重大挑战。为解决这一问题,实践中常采用分布式内存解决方案,例如通过图划分来并行训练多个GNN副本。然而,基于图划分的方法通常面临通信开销和负载不均衡问题,即使在最优划分和通信策略下,由于邻域小批量采样的不规则性,这些问题依然存在。本文针对分布式图上的表示学习(采用流行的GraphSAGE架构),通过在先进的Amazon DistDGL分布式GNN框架之上开发参数化的连续预取与淘汰机制,提出了改进采样与通信开销的实用权衡方案。在美国国家能源研究科学计算中心(NERSC)Perlmutter超级计算机上对多种OGB数据集进行的实验表明,该方法实现了约15-40%的端到端训练性能提升。