Distributed Matrix-Based Sampling for Graph Neural Network Training

The primary contribution of this paper is new methods for reducing communication in the sampling step for distributed GNN training. Here, we propose a matrix-based bulk sampling approach that expresses sampling as a sparse matrix multiplication (SpGEMM) and samples multiple minibatches at once. When the input graph topology does not fit on a single device, our method distributes the graph and use communication-avoiding SpGEMM algorithms to scale GNN minibatch sampling, enabling GNN training on much larger graphs than those that can fit into a single device memory. When the input graph topology (but not the embeddings) fits in the memory of one GPU, our approach (1) performs sampling without communication, (2) amortizes the overheads of sampling a minibatch, and (3) can represent multiple sampling algorithms by simply using different matrix constructions. In addition to new methods for sampling, we show that judiciously replicating feature data with a simple all-to-all exchange can outperform current methods for the feature extraction step in distributed GNN training. We provide experimental results on the largest Open Graph Benchmark (OGB) datasets on $128$ GPUs, and show that our pipeline is $2.5\times$ faster Quiver (a distributed extension to PyTorch-Geometric) on a $3$-layer GraphSAGE network. On datasets outside of OGB, we show a $8.46\times$ speedup on $128$ GPUs in-per epoch time. Finally, we show scaling when the graph is distributed across GPUs and scaling for both node-wise and layer-wise sampling algorithms

翻译：本文的主要贡献在于提出了减少分布式图神经网络训练中采样步骤通信开销的新方法。我们提出了一种基于矩阵的批量采样方法，将采样过程表示为稀疏矩阵乘法，并一次性采样多个小批量数据。当输入图拓扑无法容纳在单个设备上时，我们的方法通过分布式存储图结构并采用避免通信的SpGEMM算法来扩展GNN小批量采样，从而支持在远超单个设备内存容量的大规模图上进行GNN训练。当输入图拓扑（而非嵌入向量）能够适配单块GPU内存时，我们的方法能够：（1）实现无通信采样；（2）分摊单个小批量采样的开销；（3）通过不同的矩阵构造形式表示多种采样算法。除新型采样方法外，我们还证明：在分布式GNN训练的特征提取阶段，通过简单的全交换策略对特征数据进行合理复制，能够优于现有方法。我们在最大规模的开放图基准数据集上使用128块GPU进行实验，结果表明：在3层GraphSAGE网络上，我们的流水线处理速度比Quiver（PyTorch-Geometric的分布式扩展）快2.5倍。在OGB之外的测试数据集上，我们在128块GPU上实现了8.46倍的每轮训练时间加速。最后，我们展示了当图结构跨GPU分布式存储时的扩展性能，以及针对节点级采样和层级采样两种算法的扩展能力。