This work focuses on accelerating the multiplication of a dense random matrix with a (fixed) sparse matrix, which is frequently used in sketching algorithms. We develop a novel scheme that takes advantage of blocking and recomputation (on-the-fly random number generation) to accelerate this operation. The techniques we propose decrease memory movement, thereby increasing the algorithm's parallel scalability in shared memory architectures. On the Intel Frontera architecture, our algorithm can achieve 2x speedups over libraries such as Eigen and Intel MKL on some examples. In addition, with 32 threads, we can obtain a parallel efficiency of up to approximately 45%. We also present a theoretical analysis for the memory movement lower bound of our algorithm, showing that under mild assumptions, it's possible to beat the data movement lower bound of general matrix-matrix multiply (GEMM) by a factor of $\sqrt M$, where $M$ is the cache size. Finally, we incorporate our sketching algorithm into a randomized least squares solver. For extremely over-determined sparse input matrices, we show that our results are competitive with SuiteSparse; in some cases, we obtain a speedup of 10x over SuiteSparse.
翻译:本文聚焦于加速随机稠密矩阵与(固定)稀疏矩阵的乘法运算,该运算在草图算法中具有广泛应用。我们提出了一种利用分块和即时重计算(实时随机数生成)技术的新方案以加速此操作。所提出的技术降低了内存移动量,从而在共享内存架构中提升了算法的并行扩展性。在Intel Frontera架构上,我们的算法在某些示例中较Eigen和Intel MKL等库可实现2倍加速。此外,使用32线程时,我们可获得高达约45%的并行效率。我们还给出了算法内存移动下界的理论分析,表明在适度假设下,该算法能够以$\sqrt M$(其中$M$为缓存大小)的因子突破通用矩阵-矩阵乘法(GEMM)的数据移动下界。最后,我们将该草图算法融入随机最小二乘求解器。对于极度超定的稀疏输入矩阵,我们证明所得结果与SuiteSparse具有竞争力;在某些情况下,我们较SuiteSparse可获得10倍加速。