Sparse matrix-dense matrix multiplication (SpMM) is a critical kernel in scientific computing, graph analytics, and machine learning, whose performance is often constrained by memory bandwidth. In this work, we investigate the applicability and limitations of roofline modeling for SpMM by explicitly accounting for the impact of matrix sparsity structure on arithmetic intensity and attainable performance. We evaluate three SpMM implementations: Compressed Sparse Row (CSR), Compressed Sparse Blocks (CSB), and Intel's Math Kernel Library (MKL). Each implementation was tested using large-scale matrices from the SuiteSparse collection and grouped by sparsity pattern, including block-structured, banded (diagonal), scale-free, and uniformly random matrices. We derive sparsity-aware roofline models that incorporate memory traffic, cache locality, and blocking behavior, and demonstrate that a single model is insufficient to accurately predict performance across diverse structures. Experiments were conducted on an AMD-based Perlmutter compute node with a varying number of columns in the dense matrix. In particular, blocking and structured sparsity significantly alter effective arithmetic intensity. The results show that accurate roofline-based performance analysis of SpMM requires sparsity-aware modeling, and that data layout and blocking strategies must be evaluated in the context of matrix structure rather than through a single unified model.
翻译:稀疏矩阵-稠密矩阵乘法(SpMM)是科学计算、图分析与机器学习中的关键运算核,其性能常受限于内存带宽。本研究通过显式考虑矩阵稀疏结构对算术强度与可达性能的影响,系统探究了Roofline模型在SpMM中的适用性与局限性。我们评估了三种SpMM实现:压缩稀疏行格式(CSR)、压缩稀疏块格式(CSB)及英特尔数学核心函数库(MKL)。每种实现均使用来自SuiteSparse集合的大规模矩阵进行测试,并根据稀疏模式分组,包括块结构、带状(对角线)、无标度及均匀随机矩阵。我们推导了融合内存流量、缓存局域性与分块行为的稀疏感知Roofline模型,证实单一模型无法准确预测跨不同结构的性能表现。实验在基于AMD的Perlmutter计算节点上开展,并针对稠密矩阵的不同列数进行测试。结果表明:分块与结构化稀疏会显著改变有效算术强度。本工作证实,基于Roofline模型的SpMM精确性能分析需要稀疏感知建模,且数据布局与分块策略必须结合矩阵结构进行评估,而非通过单一统一模型。