The Tucker decomposition, an extension of singular value decomposition for higher-order tensors, is a useful tool in analysis and compression of large-scale scientific data. While it has been studied extensively for static datasets, there are relatively few works addressing the computation of the Tucker factorization of streaming data tensors. In this paper we propose a new streaming Tucker algorithm tailored for scientific data, specifically for the case of a data tensor whose size increases along a single streaming mode that can grow indefinitely, which is typical of time-stepping scientific applications. At any point of this growth, we seek to compute the Tucker decomposition of the data generated thus far, without requiring storing the past tensor slices in memory. Our algorithm accomplishes this by starting with an initial Tucker decomposition and updating its components--the core tensor and factor matrices--with each new tensor slice as it becomes available, while satisfying a user-specified threshold of norm error. We present an implementation within the TuckerMPI software framework, and apply it to synthetic and combustion simulation datasets. By comparing against the standard (batch) decomposition algorithm we show that our streaming algorithm provides significant improvements in memory usage. If the tensor rank stops growing along the streaming mode, the streaming algorithm also incurs less computational time compared to the batch algorithm.
翻译:Tucker分解作为奇异值分解在高阶张量上的推广,是大规模科学数据分析与压缩的重要工具。尽管该方法在静态数据集上已得到广泛研究,但针对流式数据张量Tucker分解计算的相关工作仍较少。本文提出了一种面向科学数据的新型流式Tucker算法,特别针对沿单一流模式维度持续增长(可无限增长)的数据张量场景——这是时间步进科学应用的典型特征。在数据增长过程中,我们无需存储历史张量切片到内存,即可实时计算当前数据的Tucker分解。该算法通过初始化Tucker分解,并在每获取新张量切片时更新核心张量与因子矩阵,同时满足用户指定的范数误差阈值。我们在TuckerMPI软件框架中实现了该算法,并将其应用于合成数据集与燃烧模拟数据集。通过与标准(批处理)分解算法对比表明,我们的流式算法在内存使用上具有显著优势。若流模式上的张量秩停止增长,流式算法相较于批处理算法在计算时间上亦有所减少。