Iterative memory-bound solvers commonly occur in HPC codes. Typical GPU implementations have a loop on the host side that invokes the GPU kernel as much as time/algorithm steps there are. The termination of each kernel implicitly acts the barrier required after advancing the solution every time step. We propose an execution model for running memory-bound iterative GPU kernels: PERsistent KernelS (PERKS). In this model, the time loop is moved inside persistent kernel, and device-wide barriers are used for synchronization. We then reduce the traffic to device memory by caching subset of the output in each time step in the unused registers and shared memory. PERKS can be generalized to any iterative solver: they largely independent of the solver's implementation. We explain the design principle of PERKS and demonstrate effectiveness of PERKS for a wide range of iterative 2D/3D stencil benchmarks (geomean speedup of $2.12$x for 2D stencils and $1.24$x for 3D stencils over state-of-art libraries), and a Krylov subspace conjugate gradient solver (geomean speedup of $4.86$x in smaller SpMV datasets from SuiteSparse and $1.43$x in larger SpMV datasets over a state-of-art library). All PERKS-based implementations available at: https://github.com/neozhang307/PERKS.
翻译:迭代内存受限求解器常见于高性能计算(HPC)代码中。典型的GPU实现采用主机端循环,根据时间/算法步骤数量反复调用GPU内核。每次内核终止隐式充当每时间步求解推进后所需的同步屏障。我们提出一种用于运行内存受限迭代GPU内核的执行模型:持久内核(PERsistent KernelS,PERKS)。该模型将时间循环移至持久化内核内部,并通过设备级全局同步实现同步。随后,我们通过将每时间步的部分输出缓存至未使用的寄存器和共享内存,减少对设备内存的访存流量。PERKS可推广至任意迭代求解器:其设计高度独立于求解器具体实现。我们阐释了PERKS的设计原理,并通过多种迭代2D/3D模板基准测试(2D模板相较最先进库实现几何平均加速比$2.12\times$,3D模板几何平均加速比$1.24\times$)以及Krylov子空间共轭梯度求解器(SuiteSparse中较小SpMV数据集相较最先进库实现几何平均加速比$4.86\times$,较大SpMV数据集几何平均加速比$1.43\times$)验证了PERKS的有效性。所有基于PERKS的实现代码均公开于:https://github.com/neozhang307/PERKS。