Graph Neural Networks (GNNs) have achieved remarkable success in various applications. Sampling-based GNN training, which conducts mini-batch training on sampled subgraphs, has become a promising solution for large-scale graphs. Given the resource-intensive nature of sampling-based GNN training, Neural Processing Units (NPUs), such as the Ascend AI processor, offer a promising alternative due to their high throughput and energy efficiency, making them well-suited for GNN workloads. However, the multi-stage nature of sampling-based training, which involves subgraph sampling, feature gathering, and model training, with different resource requirements and computation volume. This requires careful coordination to fully utilize the heterogeneous computation resources of CPUs and NPUs. In this work, we present AcOrch, a sampling-based GNN training system optimized for CPU-NPU heterogeneous platforms. AcOrch offers fine-grained task orchestration and adopts a two-level pipelined execution model to overlap sampling, gathering, and training. It analyzes the heterogeneous compute features of NPUs and maps tasks to AI Cube (AIC) units, AI Vector (AIV) units, and CPU cores accordingly. Moreover, the two-level pipeline enables overlapping execution not only between the CPU and NPU, but also among different types of compute units within the NPU (e.g., AIC and AIV units), thereby maximizing the utilization of available resources. Experiments on an Ascend 910B AI processor show that AcOrch achieves an average speedup of 2.31x over the state-of-the-art NPU-native graph learning system, MindSporeGL.
翻译:图神经网络(GNN)在各类应用中取得了显著成功。基于采样的GNN训练通过在采样子图上进行小批量训练,已成为大规模图处理的有效方案。鉴于采样型GNN训练的资源密集型特点,神经处理单元(NPU),如昇腾AI处理器,凭借其高吞吐量和能效优势,成为适合GNN工作负载的替代方案。然而,采样训练涉及子图采样、特征聚合和模型训练等多个阶段,各阶段对资源需求和计算量要求不同,需要精心协调以充分利用CPU和NPU的异构计算资源。本文提出AcOrch——一种针对CPU-NPU异构平台优化的基于采样的GNN训练系统。AcOrch实现细粒度任务编排,采用两级流水线执行模型,使采样、聚合和训练过程重叠进行。系统通过分析NPU异构计算特性,将任务分别映射至AI Cube(AIC)单元、AI Vector(AIV)单元和CPU核心。此外,两级流水线不仅支持CPU与NPU间的重叠执行,还可实现NPU内部不同类型计算单元(如AIC与AIV单元)的并行运算,从而最大化资源利用率。在昇腾910B AI处理器上的实验表明,相比当前最先进的NPU原生图学习系统MindSporeGL,AcOrch实现了平均2.31倍的加速比。