Distributed full-graph training of Graph Neural Networks (GNNs) over large graphs is bandwidth-demanding and time-consuming. Frequent exchanges of node features, embeddings and embedding gradients (all referred to as messages) across devices bring significant communication overhead for nodes with remote neighbors on other devices (marginal nodes) and unnecessary waiting time for nodes without remote neighbors (central nodes) in the training graph. This paper proposes an efficient GNN training system, AdaQP, to expedite distributed full-graph GNN training. We stochastically quantize messages transferred across devices to lower-precision integers for communication traffic reduction and advocate communication-computation parallelization between marginal nodes and central nodes. We provide theoretical analysis to prove fast training convergence (at the rate of O(T^{-1}) with T being the total number of training epochs) and design an adaptive quantization bit-width assignment scheme for each message based on the analysis, targeting a good trade-off between training convergence and efficiency. Extensive experiments on mainstream graph datasets show that AdaQP substantially improves distributed full-graph training's throughput (up to 3.01 X) with negligible accuracy drop (at most 0.30%) or even accuracy improvement (up to 0.19%) in most cases, showing significant advantages over the state-of-the-art works.
翻译:图神经网络(GNN)在大规模图上的分布式全图训练具有高带宽消耗和长时间特性。训练图中,设备间频繁交换节点特征、嵌入及嵌入梯度(统称为消息)会导致:对于在其他设备上有远程邻居的节点(边缘节点)产生显著通信开销,而对于无远程邻居的节点(中心节点)则带来不必要的等待时间。本文提出高效GNN训练系统AdaQP,通过随机量化设备间传输的消息为低精度整数以减少通信流量,并倡导边缘节点与中心节点间的通信-计算并行化。我们提供理论分析证明快速训练收敛性(收敛速率为O(T^{-1}),其中T为总训练轮数),并基于该分析设计自适应量化位宽分配方案以优化训练收敛性与效率的权衡。在主流图数据集上的大量实验表明,AdaQP在多数情况下将分布式全图训练的吞吐量提升高达3.01倍,同时精度损失不超过0.30%甚至提升0.19%,显著优于现有最先进工作。