Graph Convolutional Networks (GCNs) are widely used in various domains. However, training distributed full-batch GCNs on large-scale graphs poses challenges due to inefficient memory access patterns and high communication overhead. This paper presents general and efficient aggregation operators designed for irregular memory access patterns. Additionally, we propose a pre-post-aggregation approach and a quantization with label propagation method to reduce communication costs. Combining these techniques, we develop an efficient and scalable distributed GCN training framework, \emph{SuperGCN}, for CPU-powered supercomputers. Experimental results on multiple large graph datasets show that our method achieves a speedup of up to 6$\times$ compared with the SoTA implementations, and scales to 1000s of HPC-grade CPUs, without sacrificing model convergence and accuracy. Our framework achieves performance on CPU-powered supercomputers comparable to that of GPU-powered supercomputers, with a fraction of the cost and power budget.
翻译:图卷积网络(GCNs)在众多领域得到广泛应用。然而,在大规模图上训练分布式全批次GCN面临挑战,主要源于低效的内存访问模式和高昂的通信开销。本文提出了针对不规则内存访问模式设计的通用高效聚合算子。此外,我们提出了一种前后聚合方法以及结合标签传播的量化技术以降低通信成本。综合这些技术,我们为CPU驱动的超级计算机开发了一个高效且可扩展的分布式GCN训练框架——\emph{SuperGCN}。在多个大规模图数据集上的实验结果表明,相较于现有最优实现,我们的方法实现了高达6$\times$的加速比,并可扩展至数千个高性能计算级CPU,同时不牺牲模型收敛性与精度。我们的框架在CPU驱动的超级计算机上实现了与GPU驱动超级计算机相媲美的性能,而所需成本和功耗预算仅为后者的一小部分。