Incomplete LU (ILU) smoothers are effective in the algebraic multigrid (AMG) $V$-cycle for reducing high-frequency components of the error. However, the requisite direct triangular solves are comparatively slow on GPUs. Previous work has demonstrated the advantages of Jacobi iteration as an alternative to direct solution of these systems. Depending on the threshold and fill-level parameters chosen, the factors can be highly non-normal and Jacobi is unlikely to converge in a low number of iterations. We demonstrate that row scaling can reduce the departure from normality, allowing us to replace the inherently sequential solve with a rapidly converging Richardson iteration. There are several advantages beyond the lower compute time. Scaling is performed locally for a diagonal block of the global matrix because it is applied directly to the factor. Further, an ILUT Schur complement smoother maintains a constant GMRES iteration count as the number of MPI ranks increases, and thus parallel strong-scaling is improved. Our algorithms have been incorporated into hypre, and we demonstrate improved time to solution for linear systems arising in the Nalu-Wind and PeleLM pressure solvers. For large problem sizes, GMRES$+$AMG executes at least five times faster when using iterative triangular solves compared with direct solves on massively-parallel GPUs.
翻译:不完全LU(ILU)平滑器在代数多重网格(AMG)$V$-循环中能有效降低误差的高频分量。然而,所需的直接三角求解在GPU上相对较慢。已有研究表明,作为这些系统直接求解的替代方案,Jacobi迭代具有优势。根据所选的阈值和填充级参数,分解因子可能高度非正规,导致Jacobi迭代难以在少量迭代次数内收敛。我们证明,行缩放可降低非正规性偏差,从而能够用快速收敛的Richardson迭代替代固有的顺序求解。除降低计算时间外,该方法还具有多项优势。缩放操作在局部进行(针对全局矩阵的对角块),因其直接应用于分解因子。此外,ILUT Schur补平滑器能使GMRES迭代次数随MPI秩数增加保持恒定,从而提升并行强扩展性。我们的算法已集成至hypre中,并展示了在Nalu-Wind和PeleLM压力求解器产生的线性系统中求解时间的改善。对于大规模问题,在大规模并行GPU上使用迭代三角求解时,GMRES$+$AMG的执行速度比直接求解至少快五倍。