As Graph Neural Networks (GNNs) become popular, libraries like PyTorch-Geometric (PyG) and Deep Graph Library (DGL) are proposed; these libraries have emerged as the de facto standard for implementing GNNs because they provide graph-oriented APIs and are purposefully designed to manage the inherent sparsity and irregularity in graph structures. However, these libraries show poor scalability on multi-core processors, which under-utilizes the available platform resources and limits the performance. This is because GNN training is a resource-intensive workload with high volume of irregular data accessing, and existing libraries fail to utilize the memory bandwidth efficiently. To address this challenge, we propose ARGO, a novel runtime system for GNN training that offers scalable performance. ARGO exploits multi-processing and core-binding techniques to improve platform resource utilization. We further develop an auto-tuner that searches for the optimal configuration for multi-processing and core-binding. The auto-tuner works automatically, making it completely transparent from the user. Furthermore, the auto-tuner allows ARGO to adapt to various platforms, GNN models, datasets, etc. We evaluate ARGO on two representative GNN models and four widely-used datasets on two platforms. With the proposed autotuner, ARGO is able to select a near-optimal configuration by exploring only 5% of the design space. ARGO speeds up state-of-the-art GNN libraries by up to 5.06x and 4.54x on a four-socket Ice Lake machine with 112 cores and a two-socket Sapphire Rapids machine with 64 cores, respectively. Finally, ARGO can seamlessly integrate into widely-used GNN libraries (e.g., DGL, PyG) with few lines of code and speed up GNN training.
翻译:随着图神经网络(GNN)日益普及,PyTorch-Geometric(PyG)与Deep Graph Library(DGL)等库应运而生;这些库因提供面向图的API并专门设计用于管理图结构固有的稀疏性与不规则性,已成为实现GNN的事实标准。然而,这些库在多核处理器上表现出较差的扩展性,未能充分利用可用平台资源并限制了性能。究其原因,GNN训练是资源密集型工作负载,涉及大量不规则数据访问,现有库未能高效利用内存带宽。为应对这一挑战,我们提出ARGO——一种面向GNN训练的新型运行时系统,能够提供可扩展性能。ARGO利用多处理与核心绑定技术提升平台资源利用率。我们进一步开发了自动调优器,用于搜索多处理与核心绑定的最优配置。该调优器自动运行,对用户完全透明。此外,自动调优器使ARGO能够适配不同平台、GNN模型及数据集等。我们在两个平台上针对两种代表性GNN模型与四个广泛使用的数据集进行评估。借助所提自动调优器,ARGO仅需探索5%的设计空间即可选出近最优配置。ARGO在配备112核的四路Ice Lake机器与配备64核的双路Sapphire Rapids机器上,分别将最先进的GNN库加速达5.06倍与4.54倍。最后,ARGO可通过少量代码行无缝集成至广泛使用的GNN库(如DGL、PyG)中,并加速GNN训练。