Memory bandwidth is known to be a performance bottleneck for FPGA accelerators, especially when they deal with large multi-dimensional data-sets. A large body of work focuses on reducing of off-chip transfers, but few authors try to improve the efficiency of transfers. This paper addresses the later issue by proposing (i) a compiler-based approach to accelerator's data layout to maximize contiguous access to off-chip memory, and (ii) data packing and runtime compression techniques that take advantage of this layout to further improve memory performance. We show that our approach can decrease the I/O cycles up to $7\times$ compared to un-optimized memory accesses.
翻译:内存带宽被认为是FPGA加速器的性能瓶颈,尤其是在处理大规模多维数据集时。大量研究工作聚焦于减少片外传输量,但鲜有尝试提升传输效率。本文针对后者提出以下方案:(i)基于编译器的加速器数据布局方法,以最大化对片外存储器的连续访问;(ii)利用该布局的数据打包与运行时压缩技术,进一步优化内存性能。实验表明,与未优化的内存访问相比,本方法可将I/O周期减少达$7\times$。