Sparse tensors are the most used representation of sparse multidimensional data. Operations that decompose them, selecting their most important features while reducing their dimension, have become prevalent procedures in machine learning. One of the most used tensor decomposition algorithms is the Alternating Least Squares Canonical Polyadic Decomposition (CP-ALS), where the most time-consuming operation is the Sparse Matricized Tensor Times Khatri-Rao Product (spMTTKRP). This operation is strongly memory-bound, making it hard to implement efficiently on general-purpose processors. This work proposes PRISM, the first approach to tackle this operation using Processing-In-Memory (PIM) technology. We extensively characterize different partitioning strategies, number formats, and kernel optimizations that efficiently adapt this operation to UPMEM PIM, which is further boosted by heterogeneous collaboration with the CPU. The experimental results show that the proposed PIM-based and heterogeneous approaches achieve up to 2.37x and 2.64x speedup compared to state-of-the-art CPU implementations, respectively. However, the UPMEM distributed memory system can significantly hinder performance on certain workloads. Nonetheless, the efficiency of resource consumption for this approach, measured by peak performance fraction usage, is significantly higher than for both CPU and GPU.
翻译:稀疏张量是多维稀疏数据最常用的表示形式。其分解操作——在降低维度的同时提取最重要特征——已成为机器学习中的常见流程。最常用的张量分解算法之一是交替最小二乘规范多元分解(CP-ALS),其中耗时最长的操作是稀疏矩阵化张量与Khatri-Rao乘积(spMTTKRP)。该操作具有强内存受限特性,难以在通用处理器上高效实现。本文提出PRISM,这是首个利用存内处理(PIM)技术解决该操作的方法。我们深入分析了不同分区策略、数值格式和内核优化,使该操作能够高效适配UPMEM PIM平台,并通过与CPU的异构协作进一步加速。实验结果表明,与最先进的CPU实现相比,所提出的基于PIM的方法和异构方法分别实现了高达2.37倍和2.64倍的加速。然而,UPMEM分布式内存系统在某些工作负载下可能显著影响性能。尽管如此,以峰值性能份额利用率衡量的资源消耗效率仍显著高于CPU和GPU方案。