The tensor-vector contraction (TVC) is the most memory-bound operation of its class and a core component of the higher order power method (HOPM). This paper brings distributed-memory parallelization to a native TVC algorithm for dense tensors that overall remains oblivious to contraction mode, tensor splitting and tensor order. Similarly, we propose a novel distributed HOPM, namely dHOPM3, that can save up to one order of magnitude of streamed memory and is about twice as costly in terms of data movement as a distributed TVC operation (dTVC) when using task-based parallelization. The numerical experiments carried out in this work on three different architectures featuring multi-core and accelerated systems confirm that the performance of dTVC and dHOPM3 remains relatively close to the peak system memory bandwidth (50%-80%, depending on the architecture) and on par with STREAM reference values. On strong scalability scenarios, our native multi-core implementations of these two algorithms can achieve similar and sometimes even greater performance figures than those based upon state-of-the-art CUDA batched kernels. Finally, we demonstrate that both computation and communication can benefit from mixed precision arithmetic also in cases where the hardware does not support low precision data types natively.
翻译:张量-向量收缩(TVC)是该类运算中内存受限最严重的操作,也是高阶幂法(HOPM)的核心组成部分。本文为稠密张量的原生TVC算法引入了分布式内存并行化,该算法整体上对收缩模式、张量分割和张量阶数保持透明。相应地,我们提出了一种新颖的分布式HOPM算法——dHOPM3,该算法可节省高达一个数量级的流式内存访问量,且在使用基于任务的并行化时,其数据移动开销约为分布式TVC操作(dTVC)的两倍。本研究在多核及加速器系统构成的三种不同架构上进行的数值实验表明,dTVC与dHOPM3的性能始终接近系统内存带宽峰值(50%-80%,具体取决于架构),并与STREAM基准测试值相当。在强可扩展性场景中,我们针对这两种算法的原生多核实现,其性能指标与基于前沿CUDA批处理内核的实现相当,有时甚至更优。最后,我们证明了即使在硬件本身不支持低精度数据类型的情况下,混合精度运算仍能同时提升计算与通信效率。