Incomplete LU (ILU) smoothers are effective in the algebraic multigrid (AMG) $V$-cycle for reducing high-frequency components of the error. However, the requisite direct triangular solves are comparatively slow on GPUs. Previous work by Antz et al. (2015) demonstrated the advantages of Jacobi iteration as an alternative to direct solution of these systems. Depending on the threshold and fill-level parameters chosen, the factors can be highly non-normal and, in this case, Jacobi is unlikely to converge in a low number of iterations. We demonstrate that row scaling can reduce the departure from normality, allowing us to replace the inherently sequential solve with a rapidly converging Richardson iteration. There are several advantages beyond the lower compute time. Scaling is performed locally for a diagonal block of the global matrix because it is applied directly to the factor. Further, an ILUT Schur complement smoother maintains a constant GMRES iteration count as the number of MPI ranks increases, and thus parallel strong-scaling, is improved. Our algorithms have been incorporated into hypre, and we demonstrate improved time to solution for Nalu-Wind and PeleLM pressure solvers. For large problem sizes, GMRES$+$AMG executes at least five times faster when using iterative triangular solves compared with direct solves on massively-parallel GPUs.
翻译:不完全LU(ILU)光滑子在代数多重网格(AMG)$V$循环中能有效降低误差的高频分量。然而,所需的直接三角求解在GPU上相对较慢。Antz等人(2015)的前期工作展示了雅可比迭代作为这些系统直接求解替代方法的优势。根据所选阈值和填充级参数,分解因子可能高度非正规,在此情况下,雅可比迭代难以在少数迭代步内收敛。我们证明行缩放可降低非正规程度,从而将固有的串行求解替换为快速收敛的理查德森迭代。除降低计算时间外,该方法还具有多项优势:由于缩放直接应用于分解因子,它在全局矩阵的对角分块内局域执行;此外,ILUT舒尔补光滑子能在MPI进程数增加时保持恒定的GMRES迭代次数,从而改善并行强扩展性。我们的算法已集成至hypre中,并在Nalu-Wind和PeleLM压力求解器上展示了更优的计算耗时。对于大规模问题,在大规模并行GPU上使用迭代三角求解时,GMRES$+$AMG的执行速度至少比直接求解快五倍。