Achieving high performance for Sparse MatrixMatrix Multiplication (SpMM) has received increasing research attention, especially on multi-core CPUs, due to the large input data size in applications such as graph neural networks (GNNs). Most existing solutions for SpMM computation follow the aheadof-time (AOT) compilation approach, which compiles a program entirely before it is executed. AOT compilation for SpMM faces three key limitations: unnecessary memory access, additional branch overhead, and redundant instructions. These limitations stem from the fact that crucial information pertaining to SpMM is not known until runtime. In this paper, we propose JITSPMM, a just-in-time (JIT) assembly code generation framework to accelerated SpMM computation on multi-core CPUs with SIMD extensions. First, JITSPMM integrates the JIT assembly code generation technique into three widely-used workload division methods for SpMM to achieve balanced workload distribution among CPU threads. Next, with the availability of runtime information, JITSPMM employs a novel technique, coarse-grain column merging, to maximize instruction-level parallelism by unrolling the performance-critical loop. Furthermore, JITSPMM intelligently allocates registers to cache frequently accessed data to minimizing memory accesses, and employs selected SIMD instructions to enhance arithmetic throughput. We conduct a performance evaluation of JITSPMM and compare it two AOT baselines. The first involves existing SpMM implementations compiled using the Intel icc compiler with auto-vectorization. The second utilizes the highly-optimized SpMM routine provided by Intel MKL. Our results show that JITSPMM provides an average improvement of 3.8x and 1.4x, respectively.
翻译:稀疏矩阵-矩阵乘法(SpMM)的高性能计算日益受到研究关注,特别是在多核CPU平台上,因为图神经网络等应用中输入数据规模巨大。现有SpMM计算方案大多采用预编译(AOT)方法,即在程序执行前完成全部编译。AOT编译在SpMM中面临三大限制:不必要的内存访问、额外分支开销和冗余指令。这些限制源于SpMM的关键信息在运行时才可知。本文提出JITSPMM——一种面向多核CPU(支持SIMD扩展)加速SpMM计算的即时汇编代码生成框架。首先,JITSPMM将即时汇编生成技术集成到三种主流的SpMM工作负载划分方法中,实现CPU线程间负载均衡。其次,利用运行时信息,JITSPMM采用新颖的粗粒度列合并技术,通过展开性能关键循环最大化指令级并行性。此外,JITSPMM智能分配寄存器以缓存频繁访问的数据,减少内存访问,并选用特定SIMD指令提升算术吞吐量。我们将JITSPMM与两种AOT基线进行性能对比:其一为采用Intel icc编译器(启用自动向量化)编译的现有SpMM实现,其二为Intel MKL提供的高度优化SpMM例程。实验结果表明,JITSPMM相较二者分别带来平均3.8倍和1.4倍的性能提升。