This work focuses on accelerating the multiplication of a dense random matrix with a (fixed) sparse matrix, which is frequently used in sketching algorithms. We develop a novel scheme that takes advantage of blocking and recomputation (on-the-fly random number generation) to accelerate this operation. The techniques we propose decrease memory movement, thereby increasing the algorithm's parallel scalability in shared memory architectures. On the Intel Frontera architecture, our algorithm can achieve 2x speedups over libraries such as Eigen and Intel MKL on some examples. In addition, with 32 threads, we can obtain a parallel efficiency of up to approximately 45%. We also present a theoretical analysis for the memory movement lower bound of our algorithm, showing that under mild assumptions, it's possible to beat the data movement lower bound of general matrix-matrix multiply (GEMM) by a factor of $\sqrt M$, where $M$ is the cache size. Finally, we incorporate our sketching algorithm into a randomized least squares solver. For extremely over-determined sparse input matrices, we show that our results are competitive with SuiteSparse; in some cases, we obtain a speedup of 10x over SuiteSparse.
翻译:本文聚焦于加速稠密随机矩阵与(固定)稀疏矩阵的乘法,该运算在 sketching 算法中频繁使用。我们提出一种新颖方案,利用分块计算与(在线随机数生成的)重计算技术来加速这一操作。所提技术减少了内存访问次数,从而提升了算法在共享内存架构上的并行可扩展性。在 Intel Frontera 架构上,我们的算法在某些案例中相比 Eigen 和 Intel MKL 等库可实现 2 倍加速。此外,采用 32 线程时,我们能够获得高达约 45% 的并行效率。我们还对所提算法的内存访问下界进行了理论分析,表明在温和假设下,该算法能够比通用矩阵乘法(GEMM)的数据移动下界低 $\sqrt M$ 倍,其中 $M$ 为缓存大小。最后,我们将所提 sketching 算法集成到随机最小二乘求解器中。对于极度超定的稀疏输入矩阵,我们证明其结果与 SuiteSparse 具有竞争力;在某些情况下,我们获得了相比于 SuiteSparse 高达 10 倍的加速。