Mixed-precision quantization is a popular approach for compressing deep neural networks (DNNs). However, it is challenging to scale the performance efficiently with mixed-precision DNNs given the current FPGA architecture and conventional accelerator dataflows. In this work, we enhance the FPGA's capability for accelerating mixed-precision DNNs by proposing M4BRAM, a novel compute-in-block RAM (BRAM) architecture that can compute mixed-precision matrix-matrix multiplication. On the precision side, M4BRAM supports a wide range of mixed-precision DNN configurations -- the weight precision can be 2/4/8 bits while the activation precision can vary from 2 to 8 bits. On the dataflow side, M4BRAM leverages a novel in-BRAM data duplication scheme to achieve high hardware utilization. Moreover, during M4BRAM computation, other FPGA resources can seamlessly access its data without the need for a separate buffer. Hence, unlike prior compute-in-BRAM proposals, M4BRAM can simultaneously perform mixed-precision computation and maintain full functionality as a memory unit to \textit{truly} complement the existing compute resources on FPGAs. Experiments show that adding M4BRAM to a tiled DNN accelerator can achieve an average speedup of 2.16$\times$ across various DNNs on the ImageNet classification task while incurring a negligible accuracy loss of $<$ 0.5%. Compared to the same tiled accelerator that employs a prior compute-in-BRAM architecture, M4BRAM delivers 1.43$\times$ higher performance on average across various DNNs.
翻译:混合精度量化是压缩深度神经网络(DNN)的常用方法。然而,鉴于当前FPGA架构和传统加速器数据流,有效扩展混合精度DNN的性能颇具挑战。本文通过提出M4BRAM——一种可在块内存(BRAM)中进行混合精度矩阵乘法的新型计算架构——增强了FPGA加速混合精度DNN的能力。在精度方面,M4BRAM支持广泛的混合精度DNN配置:权重量化精度可为2/4/8位,而激活量化精度可在2至8位间变化。在数据流方面,M4BRAM利用一种新颖的块内存内数据复制方案实现高硬件利用率。此外,在M4BRAM计算过程中,其他FPGA资源可无缝访问其数据而无需独立缓冲器。因此,与先前基于块内存的计算架构不同,M4BRAM能同时执行混合精度计算并保持作为存储单元的完整功能,从而真正补充FPGA上的现有计算资源。实验表明,在ImageNet分类任务中,将M4BRAM集成至分块式DNN加速器可在各种DNN上实现平均2.16倍的加速比,同时精度损失可忽略不计(低于0.5%)。相较于采用先前基于块内存计算架构的同类分块加速器,M4BRAM在各种DNN上的平均性能提升1.43倍。