Graph Neural Networks (GNNs) have demonstrated outstanding performance in various applications. Existing frameworks utilize CPU-GPU heterogeneous environments to train GNN models and integrate mini-batch and sampling techniques to overcome the GPU memory limitation. In CPU-GPU heterogeneous environments, we can divide sample-based GNN training into three steps: sample, gather, and train. Existing GNN systems use different task orchestrating methods to employ each step on CPU or GPU. After extensive experiments and analysis, we find that existing task orchestrating methods fail to fully utilize the heterogeneous resources, limited by inefficient CPU processing or GPU resource contention. In this paper, we propose NeutronOrch, a system for sample-based GNN training that incorporates a layer-based task orchestrating method and ensures balanced utilization of the CPU and GPU. NeutronOrch decouples the training process by layer and pushes down the training task of the bottom layer to the CPU. This significantly reduces the computational load and memory footprint of GPU training. To avoid inefficient CPU processing, NeutronOrch only offloads the training of frequently accessed vertices to the CPU and lets GPU reuse their embeddings with bounded staleness. Furthermore, NeutronOrch provides a fine-grained pipeline design for the layer-based task orchestrating method, fully overlapping different tasks on heterogeneous resources while strictly guaranteeing bounded staleness. The experimental results show that compared with the state-of-the-art GNN systems, NeutronOrch can achieve up to 11.51x performance speedup.
翻译:图神经网络(GNN)已在多种应用中展现出卓越性能。现有框架利用CPU-GPU异构环境训练GNN模型,并集成小批量与采样技术以突破GPU内存限制。在CPU-GPU异构环境中,可将基于采样的GNN训练划分为三个步骤:采样、聚合与训练。现有GNN系统采用不同的任务编排方法在CPU或GPU上执行各步骤。经过大量实验与分析,我们发现现有任务编排方法受限于CPU处理效率低下或GPU资源争用,未能充分利用异构资源。本文提出NeutronOrch系统,该系统采用基于层次的任务编排方法,确保CPU与GPU的均衡利用。NeutronOrch按层解耦训练过程,将底层训练任务下沉至CPU,显著降低了GPU训练的计算负载与内存占用。为避免CPU低效处理,NeutronOrch仅将频繁访问顶点的训练任务卸载至CPU,并允许GPU在有限陈旧度约束下复用其嵌入表示。此外,NeutronOrch为基于层次的任务编排方法提供了细粒度流水线设计,使异构资源上的不同任务实现完全重叠,同时严格保证有界陈旧度约束。实验结果表明,与最先进的GNN系统相比,NeutronOrch可实现最高11.51倍的性能加速。