一种在GenTen中实现性能可移植的免矩阵稠密MTTKRP方法 (A Performance Portable Matrix Free Dense MTTKRP in GenTen)

We extend the GenTen tensor decomposition package by introducing an accelerated dense matricized tensor times Khatri-Rao product (MTTKRP), the workhorse kernel for canonical polyadic (CP) tensor decompositions, that is portable and performant on modern CPU and GPU architectures. In contrast to the state-of-the-art matrix multiply based MTTKRP kernels used by Tensor Toolbox, TensorLy, etc., that explicitly form Khatri-Rao matrices, we develop a matrix-free element-wise parallelization approach whose memory cost grows with the rank R like the sum of the tensor shape O(R(n+m+k)), compared to matrix-based methods whose memory cost grows like the product of the tensor shape O(R(mnk)). For the largest problem we study, a rank 2000 MTTKRP, the smaller growth rate yields a matrix-free memory cost of just 2% of the matrix-based methods, a 50x improvement. In practice, the reduced memory impact means our matrix-free MTTKRP can compute a rank 2000 tensor decomposition on a single NVIDIA H100 instead of six H100s using a matrix-based MTTKRP. We also compare our optimized matrix-free MTTKRP to baseline matrix-free implementations on different devices, showing a 3x single-device speedup on an Intel 8480+ CPU and an 11x speedup on a H100 GPU. In addition to numerical results, we provide fine grained performance models for an ideal multi-level cache machine, compare analytical performance predictions to empirical results, and provide a motivated heuristic selection for selecting an algorithmic hyperparameter.

翻译：我们扩展了GenTen张量分解软件包，引入了一种加速的稠密矩阵化张量乘Khatri-Rao积（MTTKRP）核函数——该函数是典型多线性（CP）张量分解的核心计算单元，能够在现代CPU和GPU架构上实现可移植的高性能。与Tensor Toolbox、TensorLy等工具中采用的基于显式构建Khatri-Rao矩阵的先进矩阵乘法MTTKRP核函数不同，我们开发了一种免矩阵的逐元素并行化方法，其内存开销随秩R的增长率为张量形状之和O(R(n+m+k))；而基于矩阵的方法其内存开销增长率为张量形状之积O(R(mnk))。在我们研究的最大规模问题（秩2000的MTTKRP）中，较低的增长率使得免矩阵方法的内存开销仅为基于矩阵方法的2%，实现了50倍的改进。在实际应用中，内存占用的降低意味着我们的免矩阵MTTKRP可以在单个NVIDIA H100上完成秩2000的张量分解，而基于矩阵的MTTKRP需要六个H100。我们还将优化的免矩阵MTTKRP与不同设备上的基准免矩阵实现进行对比：在Intel 8480+ CPU上实现了3倍的单设备加速，在H100 GPU上实现了11倍加速。除数值结果外，我们还为理想多级缓存机器建立了细粒度性能模型，将理论性能预测与实证结果进行对比，并为算法超参数的选择提供了理论依据启发的启发式选择策略。