Simplifying Distributed Neural Network Training on Massive Graphs: Randomized Partitions Improve Model Aggregation

Distributed training of GNNs enables learning on massive graphs (e.g., social and e-commerce networks) that exceed the storage and computational capacity of a single machine. To reach performance comparable to centralized training, distributed frameworks focus on maximally recovering cross-instance node dependencies with either communication across instances or periodic fallback to centralized training, which create overhead and limit the framework scalability. In this work, we present a simplified framework for distributed GNN training that does not rely on the aforementioned costly operations, and has improved scalability, convergence speed and performance over the state-of-the-art approaches. Specifically, our framework (1) assembles independent trainers, each of which asynchronously learns a local model on locally-available parts of the training graph, and (2) only conducts periodic (time-based) model aggregation to synchronize the local models. Backed by our theoretical analysis, instead of maximizing the recovery of cross-instance node dependencies -- which has been considered the key behind closing the performance gap between model aggregation and centralized training -- , our framework leverages randomized assignment of nodes or super-nodes (i.e., collections of original nodes) to partition the training graph such that it improves data uniformity and minimizes the discrepancy of gradient and loss function across instances. In our experiments on social and e-commerce networks with up to 1.3 billion edges, our proposed RandomTMA and SuperTMA approaches -- despite using less training data -- achieve state-of-the-art performance and 2.31x speedup compared to the fastest baseline, and show better robustness to trainer failures.

翻译：图神经网络（GNN）的分布式训练使得在超出单机存储和计算能力的大规模图（如社交网络和电商网络）上进行学习成为可能。为了达到与集中式训练相当的性能，现有分布式框架致力于通过跨实例通信或定期回退到集中式训练来最大化恢复跨实例节点依赖关系，但这些操作带来了开销并限制了框架的可扩展性。本文提出了一种简化的分布式GNN训练框架，该框架无需借助上述高成本操作，并在可扩展性、收敛速度和性能方面均优于现有最先进方法。具体而言，我们的框架：（1）组装了独立的训练器，每个训练器在本地可用的训练图部分上异步学习局部模型；（2）仅通过周期性（基于时间）的模型聚合来同步局部模型。基于理论分析，与以往将最大化恢复跨实例节点依赖关系视为缩小模型聚合与集中式训练性能差距关键的观点不同，我们的框架利用节点或超节点（即原始节点的集合）的随机分配来划分训练图，从而提升数据均匀性并最小化各实例间梯度和损失函数的差异。在包含多达13亿条边的社交网络和电商网络实验上，我们提出的RandomTMA和SuperTMA方法——尽管使用了更少的训练数据——仍达到了最先进的性能，相比最快基准实现了2.31倍加速，并展现出对训练器故障更强的鲁棒性。