The tensor-vector contraction (TVC) is the most memory-bound operation of its class and a core component of the higher-order power method (HOPM). This paper brings distributed-memory parallelization to a native TVC algorithm for dense tensors that overall remains oblivious to contraction mode, tensor splitting and tensor order. Similarly, we propose a novel distributed HOPM, namely dHOPM3, that can save up to one order of magnitude of streamed memory and is about twice as costly in terms of data movement as a distributed TVC operation (dTVC) when using task-based parallelization. The numerical experiments carried out in this work on three different architectures featuring multi-core and accelerators confirm that the performances of dTVC and dHOPM3 remain relatively close to the peak system memory bandwidth (50%-80%, depending on the architecture) and on par with STREAM benchmark figures. On strong scalability scenarios, our native multi-core implementations of these two algorithms can achieve similar and sometimes even greater performance figures than those based upon state-of-the-art CUDA batched kernels. Finally, we demonstrate that both computation and communication can benefit from mixed precision arithmetic also in cases where the hardware does not support low precision data types natively.
翻译:张量-向量收缩(TVC)是其类别中内存受限最严重的运算,也是高阶幂法(HOPM)的核心组成部分。本文为稠密张量的原生TVC算法引入了分布式内存并行化技术,该算法整体上对收缩模式、张量分割和张量阶数保持透明。相应地,我们提出了一种新颖的分布式HOPM算法——dHOPM3,该算法可节省高达一个数量级的流式内存访问量,且在使用基于任务的并行化时,其数据移动成本约为分布式TVC运算(dTVC)的两倍。本研究在多核和加速器三种不同架构上进行的数值实验证实,dTVC和dHOPM3的性能始终接近系统内存带宽峰值(根据架构不同达到50%-80%),且与STREAM基准测试数据相当。在强可扩展性场景中,这两种算法的原生多核实现能够达到与基于最先进CUDA批处理内核的实现相近甚至更优的性能表现。最后,我们证明了即使在硬件未原生支持低精度数据类型的情况下,混合精度运算仍能同时提升计算与通信效率。