Overlay is an effective approach for creating FPGA-based AI accelerators, enabling software-programmable specialized hardware datapaths to flexibly support various DNN operations. Traditional DNN overlays typically base their instruction set design on the von Neumann model but adapt them to be more coarse-grained. These instruction sets control execution at the layer granularity and impose restricted patterns for mapping computation and bandwidth resources. Such constraints cause inefficiencies from the imperfect match between supported execution patterns and diverse DNN layer shapes and types. This work proposes a Reconfigurable Stream Network architecture, a unique ISA abstraction tailored for flexible FPGA overlay execution at low cost, marking it as the first known FPGA design to support dynamic sequential linear layer pipelining. This novel architecture presents a datapath abstraction modeled after a specialized circuit-switched network with stateful functional units (FUs) as nodes and data streaming on edges. Programming a computation corresponds to triggering a network path in this stream-connected datapath. The program can individually control FUs to form paths that exploit both spatial and pipeline parallelism between independent and dependent concurrent computations. We present a proof-of-concept design RSN-XNN on the Versal VCK190. Evaluations show a 22x latency reduction for BERT compared to the state of the art, along with throughput improvements of 3.2x, 2.4x, 2.5x, and 2.8x for BERT, VIT, NCF, and MLP, respectively. RSN-XNN matches the latency of the T4 GPU with the same FP32 performance but only 18% of the memory bandwidth. Compared to the A100 GPU under the same 7nm process node, it achieves 2.1x/4.5x better operating/dynamic energy efficiency in FP32.
翻译:覆盖层是创建基于FPGA的AI加速器的有效方法,它使软件可编程的专用硬件数据路径能够灵活支持各种DNN操作。传统的DNN覆盖层通常基于冯·诺依曼模型设计其指令集,但将其调整为更粗粒度。这些指令集以层粒度控制执行,并对计算与带宽资源的映射施加限制性模式。这些约束导致支持的执行模式与多样化的DNN层形状和类型之间匹配不完善,从而产生低效问题。本研究提出一种可重构流网络架构,这是一种为低成本灵活FPGA覆盖层执行定制的独特ISA抽象,标志着其成为首个已知支持动态顺序线性层流水线的FPGA设计。该新颖架构提出了一种数据路径抽象,其建模基于以状态化功能单元为节点、数据在边流传输的专用电路交换网络。编程计算对应于在此流连接数据路径中触发网络路径。程序可独立控制功能单元以形成路径,从而利用独立及依赖并发计算间的空间与流水线并行性。我们在Versal VCK190平台上展示了概念验证设计RSN-XNN。评估显示,相较于现有技术,BERT的延迟降低22倍,同时BERT、VIT、NCF和MLP的吞吐量分别提升3.2倍、2.4倍、2.5倍和2.8倍。RSN-XNN在相同FP32性能下达到T4 GPU的延迟水平,但仅需18%的内存带宽。在相同7nm工艺节点下与A100 GPU相比,其FP32运算能效/动态能效分别提升2.1倍/4.5倍。