Sparse Matrix-Vector Multiplication (SpMV) is a fundamental operation in the inference of sparse Large Language Models (LLMs). Because existing SpMV methods perform poorly under the low and unstructured sparsity (30-90%) commonly observed in pruned LLMs, unstructured pruning provided only limited memory reduction and speedup. We propose MACKO-SpMV, a GPU-optimized format and kernel co-designed to reduce storage overhead while preserving compatibility with the GPU's execution model. This enables efficient SpMV for unstructured sparsity without specialized hardware units (e.g., tensor cores) or format-specific precomputation. Empirical results show that at sparsity 50%, MACKO is the first approach with significant 1.5x memory reduction and 1.2-1.5x speedup over dense representation. Speedups over other SpMV baselines: 2.8-13.0x over cuSPARSE, 1.9-2.6x over Sputnik, and 2.2-2.5x over DASP. Applied to Llama2-7B pruned with Wanda to sparsity 50%, it delivers 1.5x memory reduction and 1.5x faster inference at fp16 precision. Thanks to MACKO, unstructured pruning at 50% sparsity is now justified in real-world LLM workloads.
翻译:稀疏矩阵-向量乘法(SpMV)是稀疏大语言模型(LLM)推理中的核心运算。由于现有SpMV方法在剪枝后LLM常见的低且非结构化稀疏度(30-90%)下表现不佳,非结构化剪枝仅能实现有限的内存缩减与加速。本文提出MACKO-SpMV,一种协同设计的GPU优化存储格式与计算内核,旨在降低存储开销的同时保持与GPU执行模型的兼容性。该方法无需专用硬件单元(如张量核心)或格式特定的预计算,即可实现非结构化稀疏度下的高效SpMV。实验结果表明,在50%稀疏度下,MACKO首次实现了相较于稠密表示显著的内存缩减(1.5倍)与加速(1.2-1.5倍)。相比其他SpMV基线方法:相对cuSPARSE加速2.8-13.0倍,相对Sputnik加速1.9-2.6倍,相对DASP加速2.2-2.5倍。在采用Wanda剪枝至50%稀疏度的Llama2-7B模型上应用时,该方法在fp16精度下实现了1.5倍内存缩减与1.5倍推理加速。得益于MACKO,50%稀疏度的非结构化剪枝在实际LLM工作负载中现已具备应用价值。