Staleness-Alleviated Distributed GNN Training via Online Dynamic-Embedding Prediction

Despite the recent success of Graph Neural Networks (GNNs), it remains challenging to train GNNs on large-scale graphs due to neighbor explosions. As a remedy, distributed computing becomes a promising solution by leveraging abundant computing resources (e.g., GPU). However, the node dependency of graph data increases the difficulty of achieving high concurrency in distributed GNN training, which suffers from the massive communication overhead. To address it, Historical value approximation is deemed a promising class of distributed training techniques. It utilizes an offline memory to cache historical information (e.g., node embedding) as an affordable approximation of the exact value and achieves high concurrency. However, such benefits come at the cost of involving dated training information, leading to staleness, imprecision, and convergence issues. To overcome these challenges, this paper proposes SAT (Staleness-Alleviated Training), a novel and scalable distributed GNN training framework that reduces the embedding staleness adaptively. The key idea of SAT is to model the GNN's embedding evolution as a temporal graph and build a model upon it to predict future embedding, which effectively alleviates the staleness of the cached historical embedding. We propose an online algorithm to train the embedding predictor and the distributed GNN alternatively and further provide a convergence analysis. Empirically, we demonstrate that SAT can effectively reduce embedding staleness and thus achieve better performance and convergence speed on multiple large-scale graph datasets.

翻译：尽管图神经网络（GNNs）近期取得了成功，但由于邻域爆炸问题，在大规模图上训练GNNs仍具挑战性。分布式计算通过利用丰富的计算资源（如GPU）成为有前景的解决方案。然而，图数据的节点依赖性增加了分布式GNN训练中实现高并发性的难度，导致面临大量通信开销。为此，历史值近似被认为是一类有前景的分布式训练技术。该方法利用离线内存缓存历史信息（如节点嵌入），作为精确值的低成本近似，从而实现高并发性。但这种优势以使用过时训练信息为代价，导致延迟、精度降低和收敛问题。为克服这些挑战，本文提出SAT（延迟缓解训练）——一种新型可扩展的分布式GNN训练框架，能够自适应减少嵌入延迟。SAT的核心思想是将GNN的嵌入演化建模为时序图，并基于此构建模型预测未来嵌入，从而有效缓解缓存历史嵌入的延迟问题。我们提出在线算法交替训练嵌入预测器与分布式GNN，并进一步提供收敛性分析。实验表明，SAT能有效降低嵌入延迟，从而在多个大规模图数据集上取得更优性能和收敛速度。

相关内容

SAT

关注 0

SAT是研究者关注命题可满足性问题的理论与应用的第一次年度会议。除了简单命题可满足性外，它还包括布尔优化（如MaxSAT和伪布尔（PB）约束）、量化布尔公式（QBF）、可满足性模理论（SMT）和约束规划（CP），用于与布尔级推理有明确联系的问题。官网链接：http://sat2019.tecnico.ulisboa.pt/

【CVPR 2022】基于元内存传输的跨域少镜头语义分割，Remember the Difference: Cross-Domain Few-Shot Semantic Segmentation via Meta-Memory Transfer

专知会员服务

13+阅读 · 2022年3月12日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日