We present a novel, practical approach to speed up sparse matrix-vector multiplication (SpMVM) on GPUs. The novel key idea is to apply lossless entropy coding to further compress the sparse matrix when stored in one of the commonly supported formats. Our method is based on dtANS, our new lossless compression method that improves the entropy coding technique of asymmetric numeral systems (ANS) specifically for fast parallel GPU decoding when used in tandem with SpMVM. We apply dtANS on the widely used CSR format and present extensive benchmarks on the SuiteSparse collection of matrices against the state-of-the-art cuSPARSE library. On matrices with at least 2^(15) entries and at least 10 entries per row on average, our compression reduces the matrix size over the smallest cuSPARSE format (CSR, COO and SELL) in almost all cases and up to 11.77 times. Further, we achieve an SpMVM speedup for the majority of matrices with at least 2^(25) nonzero entries. The best speedup is 3.48x. We also show that we can improve over the AI-based multi-format AlphaSparse in an experiment that is limited due to its extreme computation overhead. We provide our code as an open source C++/CUDA header library, which includes both compression and multiplication kernels.
翻译:我们提出了一种新颖且实用的方法,用于加速GPU上的稀疏矩阵-向量乘法(SpMVM)。其核心创新在于:在常用稀疏矩阵存储格式的基础上,采用无损熵编码进一步压缩矩阵。该方法基于我们提出的新型无损压缩技术dtANS,该技术改进了非对称数字系统(ANS)的熵编码方法,使其在与SpMVM配合使用时,特别适用于GPU上的快速并行解码。我们将dtANS应用于广泛使用的CSR格式,并利用SuiteSparse矩阵集合对当前最先进的cuSPARSE库进行了全面基准测试。对于至少包含2^15个元素且平均每行至少10个非零元的矩阵,我们的压缩方法在几乎所有情况下都将矩阵大小缩减至最小cuSPARSE格式(CSR、COO和SELL)以下,最高压缩比达11.77倍。此外,对于至少包含2^25个非零元的大多数矩阵,我们实现了SpMVM速度提升,最高加速比达3.48倍。我们还证明,在因计算开销过大而受限的实验中,该方法可超越基于人工智能的多格式方法AlphaSparse。我们已将代码以C++/CUDA头文件库的形式开源,其中包含压缩与乘法核心函数。