Sparse Matrix-Matrix Multiplication (SpMM) is a fundamental kernel across scientific computing and machine learning. While prior work accelerates SpMM using Tensor Cores, no existing sparse kernel exploits the asynchronous features of modern GPU architectures, such as NVIDIA's Tensor Memory Accelerator (TMA) and warp specialization. This work systematically studies how these features impact SpMM performance and introduces two co-designed kernels. For structured sparsity, we optimize a warp-specialized producer-consumer pipeline overlapping TMA data transfer with WGMMA computation using Block Compressed Sparse Row (BCSR) format. For irregular sparsity, we design a Window Compressed Sparse Row (WCSR) kernel that loads the sparse operand via TMA and splits large row-windows across thread blocks for load balancing. Our WCSR kernel outperforms all prior SpMM kernels on SuiteSparse matrices (1.47x over AccSpMM, 6.24x over cuSPARSE). Our BCSR kernel achieves a combined 2.66x end-to-end speedup on Qwen2.5-7B prefill at 90% block sparsity with 64K tokens over cuDNN/cuBLAS.
翻译:稀疏矩阵—矩阵乘法(SpMM)是科学计算和机器学习中的基础计算核心。虽然先前研究利用张量核心加速SpMM,但现有稀疏计算核心尚未利用现代GPU架构的异步特性(如NVIDIA的张量内存加速器(TMA)与线程束专用化)。本工作系统研究了这些特性对SpMM性能的影响,并提出了两种协同设计计算核心。针对结构化稀疏性,我们优化了线程束专用化的生产者-消费者流水线,采用块压缩稀疏行(BCSR)格式,使TMA数据传输与WGMMA计算重叠。针对非规则稀疏性,我们设计了窗口压缩稀疏行(WCSR)计算核心,通过TMA加载稀疏操作数,并将大型行窗口跨线程块拆分以实现负载均衡。在SuiteSparse矩阵上,我们的WCSR核心性能超越所有先前SpMM核心(相对于AccSpMM提升1.47倍,相对于cuSPARSE提升6.24倍)。在Qwen2.5-7B预填充阶段,当块稀疏度为90%且序列长度为64K tokens时,我们的BCSR核心相对cuDNN/cuBLAS实现了2.66倍的端到端加速。