Sparse Candecomp / PARAFAC decomposition, a generalization of the matrix singular value decomposition to higher-dimensional tensors, is a popular tool for analyzing diverse datasets. On tensors with billions of nonzero entries, computing a CP decomposition is a computationally intensive task. We propose the first distributed-memory implementations of two randomized CP decomposition algorithms, CP-ARLS-LEV and STS-CP, that offer nearly an order-of-magnitude speedup at high decomposition ranks over well-tuned non-randomized decomposition packages. Both algorithms rely on leverage score sampling and enjoy strong theoretical guarantees, each with varying time and accuracy tradeoffs. We tailor the communication schedule for our random sampling algorithms, eliminating expensive reduction collectives and forcing communication costs to scale with the random sample count. Finally, we optimize the local storage format for our methods, switching between an analogue of compressed sparse column and compressed sparse row formats to facilitate both random sampling and efficient parallelization of sparse-dense matrix multiplication. Experiments show that our methods are fast and scalable, producing 11x speedup over SPLATT to compute a decomposition of the billion-scale Reddit tensor on 512 CPU cores in under 2 minutes.
翻译:稀疏Candecomp/PARAFAC分解(一种将矩阵奇异值分解推广到高维张量的方法)是分析多样化数据集的常用工具。对于含有数十亿非零元素的张量,计算CP分解是一项计算密集型任务。我们提出了两种随机CP分解算法(CP-ARLS-LEV和STS-CP)的首个分布式内存实现,在高分解秩下,这些算法相比经过良好调优的非随机分解包可实现近一个数量级的加速。两种算法均依赖杠杆得分采样,并具备强大的理论保证,在时间和精度之间具有不同的权衡。我们针对随机采样算法定制了通信调度,消除了昂贵的规约集合操作,使通信开销随随机样本数量扩展。最后,我们优化了方法的本地存储格式,在压缩稀疏列格式与压缩稀疏行格式的类似变体之间切换,以促进随机采样和稀疏-稠密矩阵乘法的高效并行化。实验表明,我们的方法快速且可扩展,在512个CPU核心上计算十亿级Reddit张量的分解只需不到2分钟,相比SPLATT实现了11倍加速。