Cross-partition edges dominate the cost of distributed GNN training: fetching remote features and activations per iteration overwhelms the network as graphs deepen and partition counts grow. Grappa is a distributed GNN training framework that enforces gradient-only communication: during each iteration, partitions train in isolation and exchange only gradients for the global update. To recover accuracy lost to isolation, Grappa (i) periodically repartitions to expose new neighborhoods and (ii) applies a lightweight coverage-corrected gradient aggregation inspired by importance sampling. We present an asymptotically unbiased estimator for gradient correction, which we use to develop a minimum-distance batch-level variant that is compatible with common deep-learning packages. We also introduce a shrinkage version that improves stability in practice. Empirical results on real and synthetic graphs show that Grappa trains GNNs 4x faster on average (up to 13x) than state-of-the-art systems, achieves better accuracy especially for deeper models, and sustains training at the trillion-edge scale on commodity hardware. Grappa is model-agnostic, supports full-graph and mini-batch training, and does not rely on high-bandwidth interconnects or caching.
翻译:跨分区边主导了分布式图神经网络(GNN)训练的成本:随着图结构加深和分区数量增加,每轮迭代中获取远程特征和激活值的操作会使网络负载过载。Grappa 是一个分布式 GNN 训练框架,其强制采用纯梯度通信机制:在每轮迭代中,各分区独立进行训练,仅交换用于全局更新的梯度。为弥补因隔离训练造成的精度损失,Grappa 采用两种策略:(i)周期性重分区以暴露新的邻域结构;(ii)基于重要性采样思想,采用轻量级的覆盖校正梯度聚合方法。我们提出了一种渐近无偏的梯度校正估计器,并基于此开发了与常见深度学习框架兼容的最小距离批处理级变体。同时,我们还引入了收缩版本以提升实际训练稳定性。在真实与合成图数据上的实验结果表明,Grappa 训练 GNN 的平均速度比现有先进系统快 4 倍(最高可达 13 倍),在深层模型中尤其能获得更优的精度,并能在商用硬件上支持万亿边规模的持续训练。Grappa 具备模型无关性,支持全图训练与小批量训练,且不依赖高带宽互连或缓存机制。