Staging Blocked Evaluation over Structured Sparse Matrices

The matrices used in many computational settings are naturally sparse, holding a small percentage of nonzero elements. Storing such matrices in specialized sparse formats enables algorithms that avoid wasting computation on zeros, significantly accelerating common matrix computations like sparse matrix-vector multiplication (SpMV) and sparse matrix-matrix multiplication (SpMM). In many real-world sparse matrices, however, nonzero elements are densely clustered in subregions of the matrix. For matrices that feature this sort of structured sparsity, hybrid formats can further improve performance by representing these subregions as dense blocks. Existing hybrid formats either fix the dimensions of dense blocks, padding irregular regions with zeros and wasting computation, or incur run-time overhead when iterating over variable-sized blocks. This paper presents SABLE, a framework for accelerating structured sparse matrix computations by using staging to achieve the best of both of these approaches. Ahead of execution, SABLE inspects the matrix to identify variable-sized dense subregions, which it stores using a new hybrid format. It then eliminates the overhead typically associated with variable-sized blocks by using staging to generate specialized code that is amenable to vectorization. We evaluate SABLE on SpMV and SpMM kernels using matrices from the popular SuiteSparse data set. SABLE outperforms the best available SpMV baseline by ${\sim}$10\% on average, and SpMM baselines by ${\sim}$20\%. When parallelized, SABLE achieves further speedups of up to ${\sim}7\times$ on SpMV and SpMM over the best fully-sparse baseline when using 8 threads.

翻译：在许多计算场景中使用的矩阵本质上是稀疏的，仅包含少量非零元素。将此类矩阵存储在专门的稀疏格式中，可以使算法避免在零元素上浪费计算，从而显著加速稀疏矩阵-向量乘法（SpMV）和稀疏矩阵-矩阵乘法（SpMM）等常见矩阵运算。然而，在许多现实世界的稀疏矩阵中，非零元素密集地聚集在矩阵的子区域内。对于具有此类结构化稀疏性的矩阵，混合格式可以通过将这些子区域表示为稠密块来进一步提升性能。现有的混合格式要么固定稠密块的维度，对不规则区域进行零填充并浪费计算，要么在迭代变尺寸块时产生运行时开销。本文提出了SABLE框架，该框架通过使用分阶段技术来加速结构化稀疏矩阵计算，从而兼顾这两种方法的优势。在执行前，SABLE会检查矩阵以识别变尺寸的稠密子区域，并使用一种新的混合格式进行存储。随后，它通过分阶段生成适用于向量化的专用代码，消除了通常与变尺寸块相关的开销。我们使用流行SuiteSparse数据集中的矩阵，在SpMV和SpMM内核上评估SABLE。SABLE平均优于最佳可用SpMV基线约10%，优于SpMM基线约20%。在并行化后，当使用8个线程时，SABLE在SpMV和SpMM上相比最佳全稀疏基线进一步实现了高达约7倍的加速。