ARGO: An Auto-Tuning Runtime System for Scalable GNN Training on Multi-Core Processor

As Graph Neural Networks (GNNs) become popular, libraries like PyTorch-Geometric (PyG) and Deep Graph Library (DGL) are proposed; these libraries have emerged as the de facto standard for implementing GNNs because they provide graph-oriented APIs and are purposefully designed to manage the inherent sparsity and irregularity in graph structures. However, these libraries show poor scalability on multi-core processors, which under-utilizes the available platform resources and limits the performance. This is because GNN training is a resource-intensive workload with high volume of irregular data accessing, and existing libraries fail to utilize the memory bandwidth efficiently. To address this challenge, we propose ARGO, a novel runtime system for GNN training that offers scalable performance. ARGO exploits multi-processing and core-binding techniques to improve platform resource utilization. We further develop an auto-tuner that searches for the optimal configuration for multi-processing and core-binding. The auto-tuner works automatically, making it completely transparent from the user. Furthermore, the auto-tuner allows ARGO to adapt to various platforms, GNN models, datasets, etc. We evaluate ARGO on two representative GNN models and four widely-used datasets on two platforms. With the proposed autotuner, ARGO is able to select a near-optimal configuration by exploring only 5% of the design space. ARGO speeds up state-of-the-art GNN libraries by up to 5.06x and 4.54x on a four-socket Ice Lake machine with 112 cores and a two-socket Sapphire Rapids machine with 64 cores, respectively. Finally, ARGO can seamlessly integrate into widely-used GNN libraries (e.g., DGL, PyG) with few lines of code and speed up GNN training.

翻译：随着图神经网络（GNN）日益普及，PyTorch-Geometric（PyG）与Deep Graph Library（DGL）等库应运而生；这些库因提供面向图的API并专门设计用于管理图结构固有的稀疏性与不规则性，已成为实现GNN的事实标准。然而，这些库在多核处理器上表现出较差的扩展性，未能充分利用可用平台资源并限制了性能。究其原因，GNN训练是资源密集型工作负载，涉及大量不规则数据访问，现有库未能高效利用内存带宽。为应对这一挑战，我们提出ARGO——一种面向GNN训练的新型运行时系统，能够提供可扩展性能。ARGO利用多处理与核心绑定技术提升平台资源利用率。我们进一步开发了自动调优器，用于搜索多处理与核心绑定的最优配置。该调优器自动运行，对用户完全透明。此外，自动调优器使ARGO能够适配不同平台、GNN模型及数据集等。我们在两个平台上针对两种代表性GNN模型与四个广泛使用的数据集进行评估。借助所提自动调优器，ARGO仅需探索5%的设计空间即可选出近最优配置。ARGO在配备112核的四路Ice Lake机器与配备64核的双路Sapphire Rapids机器上，分别将最先进的GNN库加速达5.06倍与4.54倍。最后，ARGO可通过少量代码行无缝集成至广泛使用的GNN库（如DGL、PyG）中，并加速GNN训练。