Cache Blocking of Distributed-Memory Parallel Matrix Power Kernels

Sparse matrix-vector products (SpMVs) are a bottleneck in many scientific codes. Due to the heavy strain on the main memory interface from loading the sparse matrix and the possibly irregular memory access pattern, SpMV typically exhibits low arithmetic intensity. Repeating these products multiple times with the same matrix is required in many algorithms. This so-called matrix power kernel (MPK) provides an opportunity for data reuse since the same matrix data is loaded from main memory multiple times, an opportunity that has only recently been exploited successfully with the Recursive Algebraic Coloring Engine (RACE). Using RACE, one considers a graph based formulation of the SpMV and employs s level-based implementation of SpMV for reuse of relevant matrix data. However, the underlying data dependencies have restricted the use of this concept to shared memory parallelization and thus to single compute nodes. Enabling cache blocking for distributed-memory parallelization of MPK is challenging due to the need for explicit communication and synchronization of data in neighboring levels. In this work, we propose and implement a flexible method that interleaves the cache-blocking capabilities of RACE with an MPI communication scheme that fulfills all data dependencies among processes. Compared to a "traditional" distributed memory parallel MPK, our new Distributed Level-Blocked MPK yields substantial speed-ups on modern Intel and AMD architectures across a wide range of sparse matrices from various scientific applications. Finally, we address a modern quantum physics problem to demonstrate the applicability of our method, achieving a speed-up of up to 4x on 832 cores of an Intel Sapphire Rapids cluster.

翻译：稀疏矩阵-向量乘（SpMV）是许多科学计算代码中的瓶颈。由于加载稀疏矩阵时对主内存接口造成巨大压力以及可能出现的不规则内存访问模式，SpMV通常表现出较低的计算强度。许多算法需要多次重复使用相同矩阵进行此类乘积运算。这种所谓的矩阵幂核（MPK）提供了数据重用的机会，因为相同矩阵数据会从主内存多次加载——这一优势直到最近才通过递归代数着色引擎（RACE）成功实现。利用RACE，研究者将SpMV转化为基于图的表述，并采用基于s层的SpMV实现以复用相关矩阵数据。然而，底层数据依赖性限制了该概念仅适用于共享内存并行化，即局限于单计算节点。由于需要显式通信和同步相邻层数据，为分布式内存并行化的MPK实现缓存分块颇具挑战。在本工作中，我们提出并实现了一种灵活方法，将RACE的缓存分块能力与满足进程间所有数据依赖性的MPI通信方案交织结合。与"传统"分布式内存并行MPK相比，我们的新型分布式层分块MPK在基于Intel和AMD的现代架构上，对来自不同科学应用领域的广泛稀疏矩阵均实现了显著加速。最后，我们通过解决现代量子物理问题验证了方法的适用性，在具有832核的Intel Sapphire Rapids集群上实现了高达4倍的加速比。