Despite the recent success of Graph Neural Networks (GNNs), it remains challenging to train GNNs on large-scale graphs due to neighbor explosions. As a remedy, distributed computing becomes a promising solution by leveraging abundant computing resources (e.g., GPU). However, the node dependency of graph data increases the difficulty of achieving high concurrency in distributed GNN training, which suffers from the massive communication overhead. To address it, Historical value approximation is deemed a promising class of distributed training techniques. It utilizes an offline memory to cache historical information (e.g., node embedding) as an affordable approximation of the exact value and achieves high concurrency. However, such benefits come at the cost of involving dated training information, leading to staleness, imprecision, and convergence issues. To overcome these challenges, this paper proposes SAT (Staleness-Alleviated Training), a novel and scalable distributed GNN training framework that reduces the embedding staleness adaptively. The key idea of SAT is to model the GNN's embedding evolution as a temporal graph and build a model upon it to predict future embedding, which effectively alleviates the staleness of the cached historical embedding. We propose an online algorithm to train the embedding predictor and the distributed GNN alternatively and further provide a convergence analysis. Empirically, we demonstrate that SAT can effectively reduce embedding staleness and thus achieve better performance and convergence speed on multiple large-scale graph datasets.
翻译:尽管图神经网络(GNNs)近期取得了成功,但由于邻域爆炸问题,在大规模图上训练GNNs仍具挑战性。分布式计算通过利用丰富的计算资源(如GPU)成为有前景的解决方案。然而,图数据的节点依赖性增加了分布式GNN训练中实现高并发性的难度,导致面临大量通信开销。为此,历史值近似被认为是一类有前景的分布式训练技术。该方法利用离线内存缓存历史信息(如节点嵌入),作为精确值的低成本近似,从而实现高并发性。但这种优势以使用过时训练信息为代价,导致延迟、精度降低和收敛问题。为克服这些挑战,本文提出SAT(延迟缓解训练)——一种新型可扩展的分布式GNN训练框架,能够自适应减少嵌入延迟。SAT的核心思想是将GNN的嵌入演化建模为时序图,并基于此构建模型预测未来嵌入,从而有效缓解缓存历史嵌入的延迟问题。我们提出在线算法交替训练嵌入预测器与分布式GNN,并进一步提供收敛性分析。实验表明,SAT能有效降低嵌入延迟,从而在多个大规模图数据集上取得更优性能和收敛速度。