Accelerating Sampling and Aggregation Operations in GNN Frameworks with GPU Initiated Direct Storage Accesses

Graph Neural Networks (GNNs) are emerging as a powerful tool for learning from graph-structured data and performing sophisticated inference tasks in various application domains. Although GNNs have been shown to be effective on modest-sized graphs, training them on large-scale graphs remains a significant challenge due to lack of efficient data access and data movement methods. Existing frameworks for training GNNs use CPUs for graph sampling and feature aggregation, while the training and updating of model weights are executed on GPUs. However, our in-depth profiling shows the CPUs cannot achieve the throughput required to saturate GNN model training throughput, causing gross under-utilization of expensive GPU resources. Furthermore, when the graph and its embeddings do not fit in the CPU memory, the overhead introduced by the operating system, say for handling page-faults, comes in the critical path of execution. To address these issues, we propose the GPU Initiated Direct Storage Access (GIDS) dataloader, to enable GPU-oriented GNN training for large-scale graphs while efficiently utilizing all hardware resources, such as CPU memory, storage, and GPU memory with a hybrid data placement strategy. By enabling GPU threads to fetch feature vectors directly from storage, GIDS dataloader solves the memory capacity problem for GPU-oriented GNN training. Moreover, GIDS dataloader leverages GPU parallelism to tolerate storage latency and eliminates expensive page-fault overhead. Doing so enables us to design novel optimizations for exploiting locality and increasing effective bandwidth for GNN training. Our evaluation using a single GPU on terabyte-scale GNN datasets shows that GIDS dataloader accelerates the overall DGL GNN training pipeline by up to 392X when compared to the current, state-of-the-art DGL dataloader.

翻译：图神经网络（GNN）正成为从图结构数据中学习并在各应用领域执行复杂推理任务的有力工具。尽管GNN在中等规模图上已展现有效性，但在大规模图上训练仍因缺乏高效数据访问与数据移动方法而面临重大挑战。现有GNN训练框架采用CPU执行图采样与特征聚合，而模型权重的训练与更新则在GPU上进行。然而，深度性能分析表明，CPU无法达到饱和GNN模型训练吞吐量所需的数据吞吐率，导致昂贵的GPU资源严重利用不足。此外，当图数据及其嵌入无法完全容纳于CPU内存时，操作系统处理缺页中断等操作引入的开销将进入执行关键路径。为解决这些问题，我们提出GPU初始化直接存储访问（GIDS）数据加载器，通过混合数据放置策略高效利用CPU内存、存储与GPU内存等所有硬件资源，实现面向GPU的大规模图训练。通过允许GPU线程直接从存储设备获取特征向量，GIDS数据加载器解决了面向GPU的GNN训练中内存容量不足的问题。同时，GIDS数据加载器利用GPU并行性容忍存储延迟，并消除了昂贵的缺页开销，从而设计出利用局部性提升有效带宽的新颖优化方法。在TB级GNN数据集上使用单GPU的评估表明，与当前最先进的DGL数据加载器相比，GIDS数据加载器将DGL GNN训练流水线的整体速度提升了高达392倍。