We propose an algorithm that aims at minimizing the inter-node communication volume for distributed and memory-efficient tensor contraction schemes on modern multi-core compute nodes. The key idea is to define processor grids that optimize intra-/inter-node communication volume in the employed contraction algorithms. We present an implementation of the proposed node-aware communication algorithm into the Cyclops Tensor Framework (CTF). We demonstrate that this implementation achieves a significantly improved performance for matrix-matrix-multiplication and tensor-contractions on up to several hundreds modern compute nodes compared to conventional implementations without using node-aware processor grids. Our implementation shows good performance when compared with existing state-of-the-art parallel matrix multiplication libraries (COSMA and ScaLAPACK). In addition to the discussion of the performance for matrix-matrix-multiplication, we also investigate the performance of our node-aware communication algorithm for tensor contractions as they occur in quantum chemical coupled-cluster methods. To this end we employ a modified version of CTF in combination with a coupled-cluster code (Cc4s). Our findings show that the node-aware communication algorithm is also able to improve the performance of coupled-cluster theory calculations for real-world problems running on tens to hundreds of compute nodes.
翻译:我们提出一种旨在最小化现代多核计算节点上分布式内存高效张量收缩方案中间节点通信量的算法。其核心思想是定义能够优化所采用的收缩算法中节点内/节点间通信量的处理器网格。我们将所提出的节点感知通信算法实现到了Cyclops Tensor Framework (CTF)中。我们证明,与未使用节点感知处理器网格的传统实现相比,该实现在多达数百个现代计算节点上针对矩阵-矩阵乘法与张量收缩取得了显著更优的性能。与现有的先进并行矩阵乘法库(COSMA和ScaLAPACK)相比,我们的实现展现出良好的性能。除了讨论矩阵-矩阵乘法的性能外,我们还研究了节点感知通信算法在量子化学耦合簇方法中常见张量收缩场景下的表现。为此,我们将CTF的修改版本与耦合簇代码(Cc4s)结合使用。研究结果表明,节点感知通信算法也能提升针对运行在数十至数百个计算节点上的实际问题的耦合簇理论计算的性能。