Exploiting mesh structure to improve multigrid performance for saddle point problems

In recent years, solvers for finite-element discretizations of linear or linearized saddle-point problems, like the Stokes and Oseen equations, have become well established. There are two main classes of preconditioners for such systems: those based on block-factorization approach and those based on monolithic multigrid. Both classes of preconditioners have several critical choices to be made in their composition, such as the selection of a suitable relaxation scheme for monolithic multigrid. From existing studies, some insight can be gained as to what options are preferable in low-performance computing settings, but there are very few fair comparisons of these approaches in the literature, particularly for modern architectures, such as GPUs. In this paper, we perform a comparison between a block-triangular preconditioner and a monolithic multigrid method with the three most common choices of relaxation scheme - Braess-Sarazin, Vanka, and Schur-Uzawa. We develop a performant Vanka relaxation algorithm for structured-grid discretizations, which takes advantage of memory efficiencies in this setting. We detail the behavior of the various CUDA kernels for the multigrid relaxation schemes and evaluate their individual arithmetic intensity, performance, and runtime. Running a preconditioned FGMRES solver for the Stokes equations with these preconditioners allows us to compare their efficiency in a practical setting. We show monolithic multigrid can outperform block-triangular preconditioning, and that using Vanka or Braess-Sarazin relaxation is most efficient. Even though multigrid with Vanka relaxation exhibits reduced performance on the CPU (up to $100\%$ slower than Braess-Sarazin), it is able to outperform Braess-Sarazin by more than $20\%$ on the GPU, making it a competitive algorithm, especially given the high amount of algorithmic tuning needed for effective Braess-Sarazin relaxation.

翻译：近年来，针对线性或线性化鞍点问题（如斯托克斯方程和欧辛方程）的有限元离散求解器已发展成熟。该类系统的预处理器主要分为两类：基于块分解方法的预处理器和基于整体多重网格的预处理器。这两类预处理器在构建过程中均需进行若干关键选择，例如为整体多重网格选取合适的松弛方案。现有研究为低性能计算场景下的优选方案提供了部分见解，但文献中对这些方法（尤其在GPU等现代架构上）的公平比较甚少。本文对块三角预处理器与三种最常见松弛方案（Braess-Sarazin、Vanka和Schur-Uzawa）的整体多重网格方法进行了比较。针对结构化网格离散问题，我们开发了一种高性能Vanka松弛算法，该算法充分利用了该场景下的内存效率优势。我们详细描述了多重网格松弛方案中各类CUDA核函数的行为，并评估了其算术强度、性能与运行时间。通过使用这些预处理器运行斯托克斯方程的预条件FGMRES求解器，我们得以在实际场景中比较其效率。研究表明，整体多重网格可优于块三角预处理，且采用Vanka或Braess-Sarazin松弛方案效率最高。尽管采用Vanka松弛的多重网格在CPU上的性能有所下降（比Braess-Sarazin慢高达$100\%$），但在GPU上其性能可比Braess-Sarazin提升超过$20\%$，使其成为一种具有竞争力的算法——尤其考虑到实现高效Braess-Sarazin松弛需要大量算法调优工作。