Training and inference with graph neural networks (GNNs) on massive graphs has been actively studied since the inception of GNNs, owing to the widespread use and success of GNNs in applications such as recommendation systems and financial forensics. This paper is concerned with minibatch training and inference with GNNs that employ node-wise sampling in distributed settings, where the necessary partitioning of vertex features across distributed storage causes feature communication to become a major bottleneck that hampers scalability. To significantly reduce the communication volume without compromising prediction accuracy, we propose a policy for caching data associated with frequently accessed vertices in remote partitions. The proposed policy is based on an analysis of vertex-wise inclusion probabilities (VIP) during multi-hop neighborhood sampling, which may expand the neighborhood far beyond the partition boundaries of the graph. VIP analysis not only enables the elimination of the communication bottleneck, but it also offers a means to organize in-memory data by prioritizing GPU storage for the most frequently accessed vertex features. We present SALIENT++, which extends the prior state-of-the-art SALIENT system to work with partitioned feature data and leverages the VIP-driven caching policy. SALIENT++ retains the local training efficiency and scalability of SALIENT by using a deep pipeline and drastically reducing communication volume while consuming only a fraction of the storage required by SALIENT. We provide experimental results with the Open Graph Benchmark data sets and demonstrate that training a 3-layer GraphSAGE model with SALIENT++ on 8 single-GPU machines is 7.1 faster than with SALIENT on 1 single-GPU machine, and 12.7 faster than with DistDGL on 8 single-GPU machines.
翻译:从图神经网络(GNN)诞生以来,基于大规模图数据的GNN训练与推理因其在推荐系统、金融取证等领域的广泛应用与成功而受到持续关注。本文聚焦于分布式环境下采用节点级采样的GNN最小批次训练与推理。在分布式存储场景下,顶点特征必须跨分布式存储进行必要划分,导致特征通信成为制约可扩展性的关键瓶颈。为在不牺牲预测精度的前提下显著降低通信量,我们提出了一种针对远程分区中高频访问顶点数据的缓存策略。该策略基于多跳邻域采样过程中顶点包含概率(VIP)的理论分析——该采样过程可能使邻域扩展远超图的分区边界。VIP分析不仅能消除通信瓶颈,还提供了组织内存数据的方案,可将GPU优先分配给访问频率最高的顶点特征。我们提出的SALIENT++系统在先前最先进的SALIENT系统基础上拓展,支持分区特征数据并运用VIP驱动的缓存策略。通过深度流水线设计,SALIENT++在保持SALIENT的本地训练效率与可扩展性的同时,仅需消耗SALIENT存储资源的极小部分即可显著降低通信量。基于Open Graph Benchmark数据集的实验表明,在8台单GPU机器上使用SALIENT++训练3层GraphSAGE模型,比在1台单GPU机器上使用SALIENT快7.1倍,比在8台单GPU机器上使用DistDGL快12.7倍。