Sparse linear iterative solvers are essential for many large-scale simulations. Much of the runtime of these solvers is often spent in the implicit evaluation of matrix polynomials via a sequence of sparse matrix-vector products. A variety of approaches has been proposed to make these polynomial evaluations explicit (i.e., fix the coefficients), e.g., polynomial preconditioners or s-step Krylov methods. Furthermore, it is nowadays a popular practice to approximate triangular solves by a matrix polynomial to increase parallelism. Such algorithms allow to evaluate the polynomial using a so-called matrix power kernel (MPK), which computes the product between a power of a sparse matrix A and a dense vector x, or a related operation. Recently we have shown that using the level-based formulation of sparse matrix-vector multiplications in the Recursive Algebraic Coloring Engine (RACE) framework we can perform temporal cache blocking of MPK to increase its performance. In this work, we demonstrate the application of this cache-blocking optimization in sparse iterative solvers. By integrating the RACE library into the Trilinos framework, we demonstrate the speedups achieved in preconditioned) s-step GMRES, polynomial preconditioners, and algebraic multigrid (AMG). For MPK-dominated algorithms we achieve speedups of up to 3x on modern multi-core compute nodes. For algorithms with moderate contributions from subspace orthogonalization, the gain reduces significantly, which is often caused by the insufficient quality of the orthogonalization routines. Finally, we showcase the application of RACE-accelerated solvers in a real-world wind turbine simulation (Nalu-Wind) and highlight the new opportunities and perspectives opened up by RACE as a cache-blocking technique for MPK-enabled sparse solvers.
翻译:稀疏线性迭代求解器对许多大规模模拟至关重要。此类求解器的大部分运行时间通常消耗在通过稀疏矩阵-向量乘序列隐式计算矩阵多项式的过程中。已有多种方法(如多项式预条件子或s步Krylov方法)被提出用于显式化这些多项式计算(即固定系数)。此外,近年来用矩阵多项式近似三角求解以提升并行性已成为普遍实践。这类算法允许通过所谓的矩阵幂核(MPK)计算多项式——该核实现稀疏矩阵A的幂与稠密向量x的乘积(或相关运算)。我们近期已证明,利用递归代数着色引擎(RACE)框架中基于层级的稀疏矩阵-向量乘法表述,可对MPK进行时间缓存阻塞以提升其性能。本文展示了该缓存阻塞优化在稀疏迭代求解器中的应用。通过将RACE库集成到Trilinos框架中,我们验证了其在预条件s步GMRES、多项式预条件子及代数多重网格法(AMG)中实现的加速效果。在MPK主导的算法中,我们于现代多核计算节点上获得了高达3倍的加速。当子空间正交化贡献适中时,加速收益显著降低——这通常源于正交化例程的质量不足。最后,我们展示了RACE加速求解器在真实风力涡轮机模拟(Nalu-Wind)中的应用,并揭示了RACE作为MPK稀疏求解器缓存阻塞技术所开辟的新机遇与前景。