The BRAM is the Limit: Shattering Myths, Shaping Standards, and Building Scalable PIM Accelerators

Many recent FPGA-based Processor-in-Memory (PIM) architectures have appeared with promises of impressive levels of parallelism but with performance that falls short of expectations due to reduced maximum clock frequencies, an inability to scale processing elements up to the maximum BRAM capacity, and minimal hardware support for large reduction operations. In this paper, we first establish what we believe should be a "Gold Standard" set of design objectives for PIM-based FPGA designs. This Gold Standard was established to serve as an absolute metric for comparing PIMs developed on different technology nodes and vendor families as well as an aspirational goal for designers. We then present IMAGine, an In-Memory Accelerated GEMV engine used as a case study to show the Gold Standard can be realized in practice. IMAGine serves as an existence proof that dispels several myths surrounding what is normally accepted as clocking and scaling FPGA performance limitations. Specifically, IMAGine clocks at the maximum frequency of the BRAM and scales to 100% of the available BRAMs. Comparative analyses are presented showing execution speeds over existing PIM-based GEMV engines on FPGAs and achieving a 2.65x - 3.2x faster clock. An AMD Alveo U55 implementation is presented that achieves a system clock speed of 737 MHz, providing 64K bit-serial multiply-accumulate (MAC) units for GEMV operation. This establishes IMAGine as the fastest PIM-based GEMV overlay, outperforming even the custom PIM-based FPGA accelerators reported to date. Additionally, it surpasses TPU v1-v2 and Alibaba Hanguang 800 in clock speed while offering an equal or greater number of MAC units.

翻译：近年来涌现的许多基于FPGA的存内处理架构虽承诺提供卓越的并行能力，但其实际性能往往未达预期，这主要归因于最大时钟频率的降低、处理单元无法扩展至BRAM最大容量，以及针对大规模规约操作的硬件支持不足。本文首先提出了一套我们认为应作为基于FPGA的存内处理设计"黄金标准"的设计目标。该黄金标准旨在为比较不同技术节点和厂商系列的存内处理设计提供绝对度量基准，同时为设计者树立理想目标。随后，我们提出IMAGine——一种作为案例研究的内存加速GEMV引擎，用以证明该黄金标准可在实践中实现。IMAGine的存在性证明打破了关于FPGA时钟频率与扩展性能极限的若干普遍认知迷思。具体而言，IMAGine能以BRAM的最高频率运行，并可扩展至100%的可用BRAM资源。对比分析显示，相较于现有基于FPGA的存内处理GEMV引擎，IMAGine的执行速度更快，时钟频率提升达2.65倍至3.2倍。基于AMD Alveo U55平台的实现达到了737 MHz的系统时钟频率，为GEMV运算提供了64K个位串行乘累加单元。这使IMAGine成为当前最快的基于存内处理的GEMV覆盖架构，其性能甚至超越了迄今报道的定制化存内处理FPGA加速器。此外，IMAGine在时钟速度上超越了TPU v1-v2及阿里巴巴含光800，同时提供同等或更多的乘累加单元。