Multiple Tensor-Times-Matrix (Multi-TTM) is a key computation in algorithms for computing and operating with the Tucker tensor decomposition, which is frequently used in multidimensional data analysis. We establish communication lower bounds that determine how much data movement is required to perform the Multi-TTM computation in parallel. The crux of the proof relies on analytically solving a constrained, nonlinear optimization problem. We also present a parallel algorithm to perform this computation that organizes the processors into a logical grid with twice as many modes as the input tensor. We show that with correct choices of grid dimensions, the communication cost of the algorithm attains the lower bounds and is therefore communication optimal. Finally, we show that our algorithm can significantly reduce communication compared to the straightforward approach of expressing the computation as a sequence of tensor-times-matrix operations.
翻译:多张量-矩阵乘法(Multi-TTM)是图克张量分解计算与操作中的核心运算,该分解广泛用于多维数据分析。我们建立了确定并行执行Multi-TTM计算所需数据移动量的通信下界,其证明关键在于解析求解一个带约束的非线性优化问题。我们还提出了一种并行算法,该算法将处理器组织成逻辑网格,其维度数量为输入张量模数的两倍。研究表明,通过正确选择网格维度,该算法的通信开销可达到下界,因此通信效率最优。最后,我们证明与将计算表示为张量-矩阵乘法序列的直接方法相比,该算法能显著降低通信量。