Cross-partition edges dominate the cost of distributed GNN training: fetching remote features and activations per iteration overwhelms the network as graphs deepen and partition counts grow. Grappa is a distributed GNN training framework that enforces gradient-only communication: during each iteration, partitions train in isolation and exchange only gradients for the global update. To recover accuracy lost to isolation, Grappa (i) periodically repartitions to expose new neighborhoods and (ii) applies a lightweight coverage-corrected gradient aggregation inspired by importance sampling. We prove the corrected estimator is asymptotically unbiased under standard support and boundedness assumptions, and we derive a batch-level variant for compatibility with common deep-learning packages that minimizes mean-squared deviation from the ideal node-level correction. We also introduce a shrinkage version that improves stability in practice. Empirical results on real and synthetic graphs show that Grappa trains GNNs 4 times faster on average (up to 13 times) than state-of-the-art systems, achieves better accuracy especially for deeper models, and sustains training at the trillion-edge scale on commodity hardware. Grappa is model-agnostic, supports full-graph and mini-batch training, and does not rely on high-bandwidth interconnects or caching.
翻译:跨分区边主导了分布式图神经网络(GNN)训练的成本:随着图结构加深和分区数量增加,每次迭代获取远程特征和激活值会使网络负载过重。Grappa 是一个分布式 GNN 训练框架,强制实施梯度专用通信机制:在每次迭代过程中,各分区独立进行训练,仅交换用于全局更新的梯度。为弥补因隔离训练造成的精度损失,Grappa 采用两种策略:(i)周期性重分区以暴露新的邻域结构;(ii)借鉴重要性采样思想,采用轻量级的覆盖校正梯度聚合方法。我们在标准支撑性和有界性假设下证明了校正估计量的渐近无偏性,并推导出批处理级变体以兼容主流深度学习框架,该变体通过最小化与理想节点级校正的均方偏差来实现优化。我们还引入了收缩版本以提升实际训练稳定性。在真实与合成图数据上的实验结果表明:Grappa 训练 GNN 的平均速度比现有先进系统快 4 倍(最高可达 13 倍),在深层模型中尤其能获得更优精度,并能在商用硬件上实现万亿边级别的持续训练。Grappa 具有模型无关性,支持全图训练与小批量训练,且不依赖高带宽互连或缓存机制。