Sparse matrix-vector multiplication (SpMV) is crucial in computational science, engineering, and machine learning. Despite substantial efforts to improve SpMV performance on GPUs through various techniques, issues related to data locality, hardware utilization, and load balancing persist, leaving room for further optimization. This paper presents CB-SpMV, a cache-friendly SpMV optimization algorithm, using a novel data convergent and adaptable 2D blocking structure. The matrix in CB-SpMV is divided into independent sub-blocks, with virtual pointers aggregating different types of intra-block data for better cache-level data locality. To enhance hardware utilization, a block-aware column aggregation strategy and the selection of sub-block formats are proposed to accelerate computation and adapt to varying sparse matrices. Finally, an inter-block load-balancing algorithm is designed to ensure efficient workload distribution across thread blocks. Experimental evaluations on 2,843 matrices from the SuiteSparse Collection show that CB-SpMV significantly improves cache hit rates and achieves average speedups of up to 3.95x over state-of-the-art methods like cuSPARSE-BSR, TileSpMV, and DASP on NVIDIA A100 and RTX 4090 GPUs. The implementation is available at: \url{https://github.com/xing-cong/CB-Sparse}.
翻译:稀疏矩阵向量乘法(SpMV)在计算科学、工程及机器学习领域至关重要。尽管已有大量工作通过多种技术提升GPU上的SpMV性能,但在数据局部性、硬件利用率及负载均衡方面仍存在挑战,留有进一步优化空间。本文提出CB-SpMV,一种缓存友好的SpMV优化算法,采用新颖的数据收敛与自适应二维分块结构。CB-SpMV中的矩阵被划分为独立子块,通过虚拟指针聚合块内不同类型数据,以提升缓存级数据局部性。为增强硬件利用率,提出块感知列聚合策略及子块格式选择机制,用于加速计算并适应不同稀疏矩阵。最后,设计块间负载均衡算法,确保线程块间工作负载高效分布。基于SuiteSparse集合中2,843个矩阵的实验评估表明,相较于cuSPARSE-BSR、TileSpMV及DASP等最新方法,CB-SpMV在NVIDIA A100与RTX 4090 GPU上显著提升缓存命中率,平均加速比最高达3.95倍。实现代码见:\url{https://github.com/xing-cong/CB-Sparse}。