The Sparse GEneral Matrix-Matrix multiplication (SpGEMM) $C = A \times B$ is a fundamental routine extensively used in domains like machine learning or graph analytics. Despite its relevance, the efficient execution of SpGEMM on vector architectures is a relatively unexplored topic. The most recent algorithm to run SpGEMM on these architectures is based on the SParse Accumulator (SPA) approach, and it is relatively efficient for sparse matrices featuring several tens of non-zero coefficients per column as it computes C columns one by one. However, when dealing with matrices containing just a few non-zero coefficients per column, the state-of-the-art algorithm is not able to fully exploit long vector architectures when computing the SpGEMM kernel. To overcome this issue we propose the SPA paRallel with Sorting (SPARS) algorithm, which computes in parallel several C columns among other optimizations, and the HASH algorithm, which uses dynamically sized hash tables to store intermediate output values. To combine the efficiency of SPA for relatively dense matrix blocks with the high performance that SPARS and HASH deliver for very sparse matrix blocks we propose H-SPA(t) and H-HASH(t), which dynamically switch between different algorithms. H-SPA(t) and H-HASH(t) obtain 1.24$\times$ and 1.57$\times$ average speed-ups with respect to SPA respectively, over a set of 40 sparse matrices obtained from the SuiteSparse Matrix Collection. For the 22 most sparse matrices, H-SPA(t) and H-HASH(t) deliver 1.42$\times$ and 1.99$\times$ average speed-ups respectively.
翻译:稀疏通用矩阵乘法(SpGEMM)$C = A \times B$ 是一种基础运算,广泛应用于机器学习和图分析等领域。尽管其重要性,SpGEMM 在向量架构上的高效执行仍是一个相对未被充分探索的课题。当前在该架构上运行 SpGEMM 的最新算法基于稀疏累加器(Sparse Accumulator, SPA)方法,通过逐列计算 C 矩阵,对于每列包含数十个非零系数的稀疏矩阵效率较高。然而,当处理每列仅含少量非零系数的矩阵时,现有算法无法充分利用长向量架构的计算潜力来执行 SpGEMM 核心运算。为解决此问题,我们提出了并行排序稀疏累加器(SPARS)算法和哈希(HASH)算法:前者通过并行计算多个 C 列及其他优化手段提升性能,后者则采用动态尺寸哈希表存储中间输出值。为结合 SPA 对相对稠密矩阵块的高效性与 SPARS 和 HASH 对极稀疏矩阵块的高性能,我们进一步提出 H-SPA(t) 和 H-HASH(t) 算法,它们可在不同算法间动态切换。基于 SuiteSparse 矩阵集合中 40 个稀疏矩阵的测试表明,相较于 SPA,H-SPA(t) 和 H-HASH(t) 平均加速比分别达 1.24 倍和 1.57 倍。对于其中最稀疏的 22 个矩阵,H-SPA(t) 和 H-HASH(t) 的平均加速比分别提升至 1.42 倍和 1.99 倍。