Graph-structured data is ubiquitous in the real world, and Graph Neural Networks (GNNs) have become increasingly popular in various fields due to their ability to process such irregular data directly. However, as data scale, GNNs become inefficient. Although parallel training offers performance improvements, increased communication costs often offset these advantages. To address this, this paper introduces CaPGNN, a novel parallel full-batch GNN training framework on single-server with multi-GPU. Firstly, considering the fact that the number of remote vertices in a partition is often greater than or equal to the number of local vertices and there may exist many duplicate vertices, we propose a joint adaptive caching algorithm that leverages both CPU and GPU memory, integrating lightweight cache update and prefetch techniques to effectively reduce redundant communication costs. Furthermore, taking into account the varying computational and communication capabilities among GPUs, we propose a communication- and computation-aware heuristic graph partitioning algorithm inspired by graph sparsification. Additionally, we implement a pipeline to overlap computation and communication. Extensive experiments show that CaPGNN improves training efficiency by up to 18.98x and reduces communication costs by up to 99%, with minimal accuracy loss or even accuracy improvement in some cases. Finally, we extend CaPGNN to multi-machine multi-GPU environments. The code is available at https://github.com/songxf1024/CaPGNN.
翻译:图结构数据在现实世界中无处不在,图神经网络因其能够直接处理此类不规则数据而在各领域日益普及。然而,随着数据规模扩大,图神经网络的训练效率会降低。尽管并行训练能提升性能,但增加的通信开销往往抵消了这些优势。为此,本文提出CaPGNN,一种在单服务器多GPU环境下的新型并行全批次图神经网络训练框架。首先,考虑到分区中远程顶点数量通常大于或等于本地顶点数量,且可能存在大量重复顶点,我们提出一种联合自适应缓存算法,该算法同时利用CPU与GPU内存,并集成轻量级缓存更新与预取技术,以有效降低冗余通信开销。此外,考虑到不同GPU之间计算与通信能力的差异,我们受图稀疏化启发,提出一种通信与计算感知的启发式图划分算法。同时,我们实现了计算与通信重叠的流水线机制。大量实验表明,CaPGNN最高可将训练效率提升18.98倍,通信开销降低达99%,且精度损失极小,部分情况下甚至能提升精度。最后,我们将CaPGNN扩展至多机多GPU环境。代码开源地址:https://github.com/songxf1024/CaPGNN。