We present a novel, practical approach to speed up sparse matrix-vector multiplication (SpMVM) on GPUs. The novel key idea is to apply lossless entropy coding to further compress the sparse matrix when stored in one of the commonly supported formats. Our method is based on dtANS, our new lossless compression method that improves the entropy coding technique of asymmetric numeral systems (ANS) specifically for fast parallel GPU decoding when used in tandem with SpMVM. We apply dtANS on the widely used CSR format and present extensive benchmarks on the SuiteSparse collection of matrices against the state-of-the-art cuSPARSE library. On matrices with at least 2^(15) entries and at least 10 entries per row on average, our compression reduces the matrix size over the smallest cuSPARSE format (CSR, COO and SELL) in almost all cases and up to 11.77 times. Further, we achieve an SpMVM speedup for the majority of matrices with at least 2^(25) nonzero entries. The best speedup is 3.48x. We also show that we can improve over the AI-based multi-format AlphaSparse in an experiment that is limited due to its extreme computation overhead. We provide our code as an open source C++/CUDA header library, which includes both compression and multiplication kernels.
翻译:我们提出了一种新颖且实用的方法,用于加速GPU上的稀疏矩阵向量乘法(SpMVM)。其核心创新思想在于应用无损熵编码,对以常见支持格式存储的稀疏矩阵进行进一步压缩。我们的方法基于dtANS——一种我们新提出的无损压缩方法,该方法改进了非对称数字系统(ANS)的熵编码技术,专门针对与SpMVM结合使用时的快速并行GPU解码进行了优化。我们将dtANS应用于广泛使用的CSR格式,并在SuiteSparse矩阵集合上针对最先进的cuSPARSE库进行了广泛的基准测试。对于非零元数量至少为2^(15)个且平均每行至少包含10个非零元的矩阵,我们的压缩方法在几乎所有情况下都能减少矩阵存储空间,相比cuSPARSE支持的最小格式(CSR、COO和SELL),压缩比最高可达11.77倍。此外,对于大多数非零元数量至少为2^(25)个的矩阵,我们实现了SpMVM的加速,最佳加速比为3.48倍。我们还通过实验证明,在计算开销极大的限制条件下,我们的方法可以超越基于人工智能的多格式方法AlphaSparse。我们将代码以开源C++/CUDA头文件库的形式提供,其中包含压缩和乘法内核。