Sparse Matrix-Matrix Multiplication (SpMM) is a fundamental operation in graph computing and analytics. However, the irregularity of real-world graphs poses significant challenges to achieving efficient SpMM operation for graph data on GPUs. Recently, significant advancements in GPU computing power and the introduction of new efficient computing cores within GPUs offer new opportunities for acceleration. In this paper, we present HC-SpMM, a pioneering algorithm that leverages hybrid GPU cores (Tensor cores and CUDA cores) to accelerate SpMM for graphs. To adapt to the computing characteristics of different GPU cores, we investigate the impact of sparse graph features on the performance of different cores, develop a data partitioning technique for the graph adjacency matrix, and devise a novel strategy for intelligently selecting the most efficient cores for processing each submatrix. Additionally, we optimize it by considering memory access and thread utilization, to utilize the computational resources to their fullest potential. To support complex graph computing workloads, we integrate HC-SpMM into the GNN training pipeline. Furthermore, we propose a kernel fusion strategy to enhance data reuse, as well as a cost-effective graph layout reorganization method to mitigate the irregular and sparse issues of real-world graphs, better fitting the computational models of hybrid GPU cores. Extensive experiments on 14 real-world graph datasets demonstrate that HC-SpMM achieves an average speedup of 1.33x and 1.23x over state-of-the-art SpMM kernels and GNN frameworks.
翻译:稀疏矩阵-矩阵乘法(SpMM)是图计算与分析中的基本运算。然而,现实世界图数据的不规则性为在GPU上实现高效的SpMM操作带来了巨大挑战。近年来,GPU计算能力的显著提升以及新型高效计算核心的引入,为加速计算提供了新的机遇。本文提出HC-SpMM,这是一种开创性算法,它利用混合GPU核心(张量核心与CUDA核心)来加速图的SpMM运算。为适应不同GPU核心的计算特性,我们研究了稀疏图特征对不同核心性能的影响,开发了针对图邻接矩阵的数据划分技术,并设计了一种新颖的策略,用于智能选择处理每个子矩阵的最高效核心。此外,我们通过考虑内存访问与线程利用率对其进行了优化,以充分发挥计算资源的潜力。为支持复杂的图计算工作负载,我们将HC-SpMM集成到GNN训练流程中。进一步地,我们提出了一种核融合策略以增强数据复用,以及一种经济高效的图布局重组方法,以缓解现实世界图数据的不规则性与稀疏性问题,从而更好地适配混合GPU核心的计算模型。在14个真实世界图数据集上的大量实验表明,HC-SpMM相较于最先进的SpMM内核与GNN框架,平均分别实现了1.33倍与1.23倍的加速。