It is well known that to accelerate stencil codes on CPUs or GPUs and to exploit hardware caches and their lines optimizers must find spatial and temporal locality of array accesses to harvest data-reuse opportunities. On FPGAs there is the burden that there are no built-in caches (or only pre-built hardware descriptions for cache blocks that are inefficient for stencil codes). But this paper demonstrates that this lack is also a chance as polyhedral methods can be used to generate stencil-specific cache-structures of the right sizes on the FPGA and to fill and flush them efficiently with wide bursts during stencil execution. The paper shows how to derive the appropriate directives and code restructurings from stencil codes so that the FPGA compiler generates fast stencil hardware. Switching on our optimization improves the runtime of a set of 10 stencils by between 43x and 156x.
翻译:众所周知,在CPU或GPU上加速模板计算并利用硬件缓存及其行缓存时,优化器必须发现数组访问的空间与时间局部性以获取数据复用机会。而FPGA面临的问题是缺乏内置缓存(或仅存对模板计算效率低下的缓存块预置硬件描述)。但本文证明这一不足亦是机遇——多面体方法可在FPGA上生成尺寸匹配的模板专用缓存结构,并在模板执行期间通过宽突发高效完成数据填充与冲刷。本文展示了如何从模板代码推导出适当的指令与代码重构,使FPGA编译器生成高速模板硬件。启用我们的优化后,10个模板程序的运行时间提升了43倍至156倍。