Significant computational resources are required to train Graph Neural Networks (GNNs) at a large scale, and the process is highly data-intensive. One of the most effective ways to reduce resource requirements is minibatch training coupled with graph sampling. GNNs have the unique property that items in a minibatch have overlapping data. However, the commonly implemented Independent Minibatching approach assigns each Processing Element (PE) its own minibatch to process, leading to duplicated computations and input data access across PEs. This amplifies the Neighborhood Explosion Phenomenon (NEP), which is the main bottleneck limiting scaling. To reduce the effects of NEP in the multi-PE setting, we propose a new approach called Cooperative Minibatching. Our approach capitalizes on the fact that the size of the sampled subgraph is a concave function of the batch size, leading to significant reductions in the amount of work per seed vertex as batch sizes increase. Hence, it is favorable for processors equipped with a fast interconnect to work on a large minibatch together as a single larger processor, instead of working on separate smaller minibatches, even though global batch size is identical. We also show how to take advantage of the same phenomenon in serial execution by generating dependent consecutive minibatches. Our experimental evaluations show up to 4x bandwidth savings for fetching vertex embeddings, by simply increasing this dependency without harming model convergence. Combining our proposed approaches, we achieve up to 64% speedup over Independent Minibatching on single-node multi-GPU systems.
翻译:训练大规模图神经网络(GNN)需要大量计算资源,且过程高度数据密集。降低资源需求最有效的方法之一是将小批量训练与图采样相结合。GNN具有独特性质:小批量中的项目存在数据重叠。然而,常用的独立小批量处理方法为每个处理单元(PE)分配独立的小批量,导致各PE间出现重复计算和输入数据访问,从而加剧了制约扩展性的主要瓶颈——邻域爆炸现象(NEP)。为在多PE环境下减轻NEP的影响,我们提出了一种名为协作小批量处理的新方法。该方法基于以下事实:采样子图规模是批次大小的凹函数,因此随着批次增大,每个种子顶点的工作量显著降低。因此,对于配备高速互连的处理器而言,共同处理一个更大的小批量(而非各自处理较小的独立小批量)更为有利——即使全局批次大小相同。我们还展示了如何通过生成依赖的连续小批量,在串行执行中利用同一现象。实验评估表明,仅通过增加这种依赖性而不影响模型收敛,即可实现最高4倍的顶点嵌入带宽节省。结合所提方法,我们在单节点多GPU系统上实现了相比独立小批量处理最高64%的加速。