We present BatchGNN, a distributed CPU system that showcases techniques that can be used to efficiently train GNNs on terabyte-sized graphs. It reduces communication overhead with macrobatching in which multiple minibatches' subgraph sampling and feature fetching are batched into one communication relay to reduce redundant feature fetches when input features are static. BatchGNN provides integrated graph partitioning and native GNN layer implementations to improve runtime, and it can cache aggregated input features to further reduce sampling overhead. BatchGNN achieves an average $3\times$ speedup over DistDGL on three GNN models trained on OGBN graphs, outperforms the runtimes reported by distributed GPU systems $P^3$ and DistDGLv2, and scales to a terabyte-sized graph.
翻译:我们提出BatchGNN,一个分布式CPU系统,展示了用于高效训练太字节级图GNN的技术。该系统通过宏批处理(macrobatching)降低通信开销——将多个小批量的子图采样与特征获取合并为一次通信中继,从而在输入特征静态时减少冗余特征提取。BatchGNN提供集成图划分与原生GNN层实现以优化运行时性能,并可缓存聚合后的输入特征来进一步降低采样开销。在OGBN图数据集上使用三种GNN模型进行测试时,BatchGNN相较DistDGL平均取得3倍加速,优于分布式GPU系统P³和DistDGLv2所报告的运行时性能,并可扩展至太字节级图。