Deep learning demonstrates effectiveness across a wide range of tasks. However, the dense and over-parameterized nature of these models results in significant resource consumption during deployment. In response to this issue, weight pruning, particularly through N:M sparsity matrix multiplication, offers an efficient solution by transforming dense operations into semi-sparse ones. N:M sparsity provides an option for balancing performance and model accuracy, but introduces more complex programming and optimization challenges. To address these issues, we design a systematic top-down performance analysis model for N:M sparsity. Meanwhile, NM-SpMM is proposed as an efficient general N:M sparsity implementation. Based on our performance analysis, NM-SpMM employs a hierarchical blocking mechanism as a general optimization to enhance data locality, while memory access optimization and pipeline design are introduced as sparsity-aware optimization, allowing it to achieve close-to-theoretical peak performance across different sparsity levels. Experimental results show that NM-SpMM is 2.1x faster than nmSPARSE (the state-of-the-art for general N:M sparsity) and 1.4x to 6.3x faster than cuBLAS's dense GEMM operations, closely approaching the theoretical maximum speedup resulting from the reduction in computation due to sparsity. NM-SpMM is open source and publicly available at https://github.com/M-H482/NM-SpMM.
翻译:深度学习在众多任务中展现出卓越效能。然而,这些模型因其稠密且过度参数化的特性,在部署过程中会消耗大量计算资源。针对此问题,权重剪枝技术——特别是通过N:M稀疏矩阵乘法——提供了一种高效解决方案,将稠密运算转化为半稀疏运算。N:M稀疏性为平衡性能与模型精度提供了一种可行选择,但同时也带来了更复杂的编程与优化挑战。为解决这些问题,我们设计了一种系统化的自上而下性能分析模型,专门用于N:M稀疏性分析。同时,我们提出了NM-SpMM作为高效的通用N:M稀疏性实现方案。基于性能分析,NM-SpMM采用分层分块机制作为通用优化策略以提升数据局部性,同时引入内存访问优化与流水线设计作为稀疏感知优化手段,使其能在不同稀疏度下实现接近理论峰值的性能。实验结果表明,NM-SpMM比当前最先进的通用N:M稀疏实现nmSPARSE快2.1倍,比cuBLAS的稠密GEMM运算快1.4至6.3倍,几乎达到因稀疏性减少计算量所带来的理论最大加速比。NM-SpMM已开源,项目地址为https://github.com/M-H482/NM-SpMM。