Iterative memory-bound solvers commonly occur in HPC codes. Typical GPU implementations have a loop on the host side that invokes the GPU kernel as much as time/algorithm steps there are. The termination of each kernel implicitly acts the barrier required after advancing the solution every time step. We propose an execution model for running memory-bound iterative GPU kernels: PERsistent KernelS (PERKS). In this model, the time loop is moved inside persistent kernel, and device-wide barriers are used for synchronization. We then reduce the traffic to device memory by caching subset of the output in each time step in the unused registers and shared memory. PERKS can be generalized to any iterative solver: they largely independent of the solver's implementation. We explain the design principle of PERKS and demonstrate effectiveness of PERKS for a wide range of iterative 2D/3D stencil benchmarks (geomean speedup of $2.12$x for 2D stencils and $1.24$x for 3D stencils over state-of-art libraries), and a Krylov subspace conjugate gradient solver (geomean speedup of $4.86$x in smaller SpMV datasets from SuiteSparse and $1.43$x in larger SpMV datasets over a state-of-art library). All PERKS-based implementations available at: https://github.com/neozhang307/PERKS.
翻译:迭代内存受限求解器常见于高性能计算代码中。典型的GPU实现方式是在主机端维护循环,每次时间步或算法步骤调用GPU内核。每个内核的终止隐式充当推进每个时间步求解后所需的屏障。我们提出了一种用于运行内存受限迭代GPU内核的执行模型:持久化内核(PERsistent KernelS,PERKS)。在此模型中,时间循环被移入持久化内核内部,并采用设备级屏障进行同步。随后,我们通过在每个时间步中将输出子集缓存至未使用的寄存器和共享内存中,减少了对设备内存的访问流量。PERKS可推广至任意迭代求解器:其设计在很大程度上独立于求解器的具体实现。我们阐述了PERKS的设计原理,并通过一系列二维/三维迭代模板基准测试(与现有最优库相比,二维模板几何平均加速比达2.12倍,三维模板达1.24倍)及Krylov子空间共轭梯度求解器(与现有最优库相比,SuiteSparse较小SpMV数据集的几何平均加速比达4.86倍,较大SpMV数据集达1.43倍)验证了其效果。所有基于PERKS的实现代码均可在https://github.com/neozhang307/PERKS 获取。