Sparse tensors are prevalent in real-world applications, often characterized by their large-scale, high-order, and high-dimensional nature. Directly handling raw tensors is impractical due to the significant memory and computational overhead involved. The current mainstream approach involves compressing or decomposing the original tensor. One popular tensor decomposition algorithm is the Tucker decomposition. However, existing state-of-the-art algorithms for large-scale Tucker decomposition typically relax the original optimization problem into multiple convex optimization problems to ensure polynomial convergence. Unfortunately, these algorithms tend to converge slowly. In contrast, tensor decomposition exhibits a simple optimization landscape, making local search algorithms capable of converging to a global (approximate) optimum much faster. In this paper, we propose the FastTuckerPlus algorithm, which decomposes the original optimization problem into two non-convex optimization problems and solves them alternately using the Stochastic Gradient Descent method. Furthermore, we introduce cuFastTuckerPlus, a fine-grained parallel algorithm designed for GPU platforms, leveraging the performance of tensor cores. This algorithm minimizes memory access overhead and computational costs, surpassing the state-of-the-art algorithms. Our experimental results demonstrate that our method achieves a speedup of $3X$ to $5X$ compared to state-of-the-art algorithms.
翻译:稀疏张量在现实应用中普遍存在,通常具有大规模、高阶、高维度的特点。直接处理原始张量会带来巨大的内存和计算开销,因此不可行。当前主流方法是对原始张量进行压缩或分解。一种流行的张量分解算法是Tucker分解。然而,现有的大规模Tucker分解最先进算法通常将原始优化问题松弛为多个凸优化问题,以保证多项式收敛性。不幸的是,这些算法收敛速度较慢。相比之下,张量分解具有简单的优化地形,使得局部搜索算法能够更快地收敛到全局(近似)最优解。本文提出FastTuckerPlus算法,将原始优化问题分解为两个非凸优化问题,并使用随机梯度下降法交替求解。此外,我们介绍了cuFastTuckerPlus,一种专为GPU平台设计的细粒度并行算法,充分利用了张量核心的性能。该算法最小化内存访问开销和计算成本,超越了现有最先进算法。我们的实验结果表明,与最先进算法相比,我们的方法实现了$3X$至$5X$的加速比。