MaxK-GNN: Extremely Fast GPU Kernel Design for Accelerating Graph Neural Networks Training

In the acceleration of deep neural network training, the GPU has become the mainstream platform. GPUs face substantial challenges on GNNs, such as workload imbalance and memory access irregularities, leading to underutilized hardware. Existing solutions such as PyG, DGL with cuSPARSE, and GNNAdvisor frameworks partially address these challenges but memory traffic is still significant. We argue that drastic performance improvements can only be achieved by the vertical optimization of algorithm and system innovations, rather than treating the speedup optimization as an "after-thought" (i.e., (i) given a GNN algorithm, designing an accelerator, or (ii) given hardware, mainly optimizing the GNN algorithm). In this paper, we present MaxK-GNN, an advanced high-performance GPU training system integrating algorithm and system innovation. (i) We introduce the MaxK nonlinearity and provide a theoretical analysis of MaxK nonlinearity as a universal approximator, and present the Compressed Balanced Sparse Row (CBSR) format, designed to store the data and index of the feature matrix after nonlinearity; (ii) We design a coalescing enhanced forward computation with row-wise product-based SpGEMM Kernel using CBSR for input feature matrix fetching and strategic placement of a sparse output accumulation buffer in shared memory; (iii) We develop an optimized backward computation with outer product-based and SSpMM Kernel. We conduct extensive evaluations of MaxK-GNN and report the end-to-end system run-time. Experiments show that MaxK-GNN system could approach the theoretical speedup limit according to Amdahl's law. We achieve comparable accuracy to SOTA GNNs, but at a significantly increased speed: 3.22/4.24 times speedup (vs. theoretical limits, 5.52/7.27 times) on Reddit compared to DGL and GNNAdvisor implementations.

翻译：在深度神经网络训练加速领域，GPU已成为主流平台。但面对图神经网络（GNN）时，GPU面临负载不均衡与内存访问不规则等重大挑战，导致硬件利用率不足。现有解决方案如PyG、基于cuSPARSE的DGL及GNNAdvisor框架虽能部分应对上述挑战，但内存流量仍然显著。本文认为，只有通过算法与系统创新的垂直优化——而非将加速优化视为"事后反思"（即（i）给定GNN算法后设计加速器，或（ii）给定硬件后主要优化GNN算法）——才能实现根本性的性能突破。本文提出MaxK-GNN，一种融合算法与系统创新的先进高性能GPU训练系统：（i）引入MaxK非线性变换，从理论上证明其作为通用逼近器的有效性，并设计压缩平衡稀疏行（CBSR）格式以存储非线性变换后特征矩阵的数据与索引；（ii）设计基于CBSR的逐行乘积SpGEMM内核，通过合并增强的前向计算实现输入特征矩阵的读取优化，并在共享内存中策略性地布置稀疏输出累加缓冲区；（iii）开发基于外积与SSpMM内核的优化反向计算。我们对MaxK-GNN进行系统级端到端运行时评估。实验表明：基于Amdahl定律，MaxK-GNN系统可逼近理论加速极限。与现有最优GNN方法相比，我们在保持相当精度的同时实现了显著加速：在Reddit数据集上较DGL和GNNAdvisor实现分别获得3.22/4.24倍加速（理论极限为5.52/7.27倍）。