We present an implementation of Pagh's compressed matrix multiplication algorithm, a randomized algorithm that constructs sketches of matrices to compute an unbiased estimate of their product. By leveraging fast polynomial multiplication via the FFT, the algorithm achieves high performance when the product matrix is sparse or contains only a small number of entries with magnitudes significantly larger than the rest. We show empirically that the algorithm is practical and can outperform state-of-the-art DGEMM implementations when the product matrix has few nonzero entries or is otherwise dominated by a small subset of elements with large magnitude. As a minor theoretical contribution, we replace the FFT with the Fast Walsh-Hadamard Transform (FWHT) in sketched multiplication, preserving all correctness and variance guarantees of the original algorithm. Experiments with our carefully engineered multithreaded CPU implementation for dense double-precision matrices on 64-core CPU nodes across a range of synthetic benchmarks, exhibiting variable sparsity patterns, show that the FWHT variant is up to 4 times faster than the FFT-based version. Under favorable sparsity and magnitude patterns in the product matrix, our FWHT-based implementation achieves a speedup of up to 40 over DGEMM from Intel MKL, with low probability of error in the estimates. Our implementation is released as free software and comes with NumPy-compatible Python bindings.
翻译:本文实现了Pagh的压缩矩阵乘法算法,这是一种通过构建矩阵草图来无偏估计其乘积的随机算法。该算法借助基于FFT的快速多项式乘法,在乘积矩阵稀疏或仅包含少量幅值显著大于其他元素的条目时能够实现高性能。实验表明,当乘积矩阵非零元稀少或主要由幅值较大的少量元素主导时,该算法具有实用性,且能超越当前最先进的DGEMM实现。作为一项次要的理论贡献,我们在草图乘法中用快速沃尔什-哈达玛变换(FWHT)替代了FFT,同时完全保留了原算法的正确性与方差保证。我们在64核CPU节点上针对具有不同稀疏模式的合成基准测试,对稠密双精度矩阵进行了精心设计的多线程CPU实现实验。结果表明,FWHT变体比基于FFT的版本快达4倍。当乘积矩阵具有理想的稀疏性与幅值分布模式时,我们基于FWHT的实现相比Intel MKL的DGEMM可获得最高40倍的加速比,且估计误差概率较低。本实现已作为自由软件发布,并提供与NumPy兼容的Python接口。