Sparse tensors are prevalent in real-world applications, often characterized by their large-scale, high-order, and high-dimensional nature. Directly handling raw tensors is impractical due to the significant memory and computational overhead involved. The current mainstream approach involves compressing or decomposing the original tensor. One popular tensor decomposition algorithm is the Tucker decomposition. However, existing state-of-the-art algorithms for large-scale Tucker decomposition typically relax the original optimization problem into multiple convex optimization problems to ensure polynomial convergence. Unfortunately, these algorithms tend to converge slowly. In contrast, tensor decomposition exhibits a simple optimization landscape, making local search algorithms capable of converging to a global (approximate) optimum much faster. In this paper, we propose the FastTuckerPlus algorithm, which decomposes the original optimization problem into two non-convex optimization problems and solves them alternately using the Stochastic Gradient Descent method. Furthermore, we introduce cuFastTuckerPlus, a fine-grained parallel algorithm designed for GPU platforms, leveraging the performance of tensor cores. This algorithm minimizes memory access overhead and computational costs, surpassing the state-of-the-art algorithms. Our experimental results demonstrate that our method achieves a speedup of $3X$ to $5X$ compared to state-of-the-art algorithms.
翻译:稀疏张量在现实应用中普遍存在,通常具有大规模、高阶和高维的特性。由于涉及显著的内存和计算开销,直接处理原始张量是不切实际的。当前的主流方法是对原始张量进行压缩或分解。Tucker分解是一种流行的张量分解算法。然而,现有的大规模Tucker分解最先进算法通常将原始优化问题松弛为多个凸优化问题,以确保多项式收敛。遗憾的是,这些算法往往收敛缓慢。相比之下,张量分解展现出简单的优化景观,使得局部搜索算法能够更快地收敛到全局(近似)最优解。本文提出了FastTuckerPlus算法,它将原始优化问题分解为两个非凸优化问题,并使用随机梯度下降法交替求解。此外,我们引入了cuFastTuckerPlus,这是一种为GPU平台设计的细粒度并行算法,充分利用了张量核心的性能。该算法最大限度地减少了内存访问开销和计算成本,超越了现有最先进的算法。我们的实验结果表明,与最先进的算法相比,我们的方法实现了$3$倍到$5$倍的加速。