Stencil computation is an extensively-utilized class of scientific-computing applications that can be efficiently accelerated by graphics processing units (GPUs). Out-of-core approaches enable a GPU to handle large stencil codes whose data size is beyond the memory capacity of the GPU. However, current research on out-of-core stencil computation primarily focus on minimizing the amount of data transferred between the CPU and GPU. Few studies consider simultaneously optimizing data transfer and kernel execution. To fill the research gap, this work presents a synergy between on- and off-chip data reuse for out-of-core stencil codes, termed SO2DR. First, overlapping regions between data chunks are shared in the off-chip memory to eliminate redundant CPU-GPU data transfer. Secondly, redundant computation at the off-chip memory level is intentionally introduced to decouple kernel execution from region sharing, hence enabling data reuse in the on-chip memory. Experimental results demonstrate that SO2DR significantly enhances the kernel-execution performance while reducing the CPU-GPU data-transfer time. Specifically, SO2DR achieves average speedups of 2.78x and 1.14x for five stencil benchmarks, compared to an out-of-core stencil code which is free of redundant transfer and computation, and an in-core stencil code which is free of data transfer, respectively.
翻译:模板计算是一类广泛应用于科学计算的应用,可通过图形处理器(GPU)高效加速。核外方法使GPU能够处理数据规模超出GPU内存容量的大型模板代码。然而,当前关于核外模板计算的研究主要侧重于最小化CPU与GPU间的数据传输量,鲜有研究同时优化数据传输与内核执行。为填补这一研究空白,本文提出一种用于核外模板代码的片上与片外数据复用协同方法,命名为SO2DR。首先,数据块间的重叠区域在片外存储器中共享,以消除冗余的CPU-GPU数据传输。其次,有意在片外存储器层面引入冗余计算,使内核执行与区域共享解耦,从而在片内存储器中实现数据复用。实验结果表明,SO2DR在减少CPU-GPU数据传输时间的同时显著提升了内核执行性能。具体而言,与无冗余传输和计算的核外模板代码以及无数据传输的核内模板代码相比,SO2DR在五个模板基准测试中分别实现了平均2.78倍和1.14倍的加速比。