Several distributed frameworks have been developed to scale Graph Neural Networks (GNNs) on billion-size graphs. On several benchmarks, we observe that the graph partitions generated by these frameworks have heterogeneous data distributions and class imbalance, affecting convergence, and resulting in lower performance than centralized implementations. We holistically address these challenges and develop techniques that reduce training time and improve accuracy. We develop an Edge-Weighted partitioning technique to improve the micro average F1 score (accuracy) by minimizing the total entropy. Furthermore, we add an asynchronous personalization phase that adapts each compute-host's model to its local data distribution. We design a class-balanced sampler that considerably speeds up convergence. We implemented our algorithms on the DistDGL framework and observed that our training techniques scale much better than the existing training approach. We achieved a (2-3x) speedup in training time and 4\% improvement on average in micro-F1 scores on 5 large graph benchmarks compared to the standard baselines.
翻译:为了在十亿级规模的图上扩展图神经网络(GNN),研究者已开发出多种分布式框架。在多个基准测试中,我们观察到这些框架生成的图分区存在数据分布异质性和类别不平衡问题,这不仅影响收敛速度,还导致性能低于集中式实现。我们针对这些挑战提出系统性解决方案,开发了减少训练时间并提升精度的技术。我们通过最小化总熵,提出了边加权分区技术以改善微平均F1分数(精度)。此外,我们增加了异步个性化阶段,使每个计算主机的模型适应其本地数据分布。我们设计了类平衡采样器,显著加速了收敛过程。我们在DistDGL框架上实现了所提出算法,实验表明我们的训练技术相比现有方法的扩展性更优。在5个大型图基准测试中,与标准基线相比,我们实现了训练时间2-3倍的加速,以及微平均F1分数平均4%的提升。