M4BRAM: Mixed-Precision Matrix-Matrix Multiplication in FPGA Block RAMs

Mixed-precision quantization is a popular approach for compressing deep neural networks (DNNs). However, it is challenging to scale the performance efficiently with mixed-precision DNNs given the current FPGA architecture and conventional accelerator dataflows. In this work, we enhance the FPGA's capability for accelerating mixed-precision DNNs by proposing M4BRAM, a novel compute-in-block RAM (BRAM) architecture that can compute mixed-precision matrix-matrix multiplication. On the precision side, M4BRAM supports a wide range of mixed-precision DNN configurations -- the weight precision can be 2/4/8 bits while the activation precision can vary from 2 to 8 bits. On the dataflow side, M4BRAM leverages a novel in-BRAM data duplication scheme to achieve high hardware utilization. Moreover, during M4BRAM computation, other FPGA resources can seamlessly access its data without the need for a separate buffer. Hence, unlike prior compute-in-BRAM proposals, M4BRAM can simultaneously perform mixed-precision computation and maintain full functionality as a memory unit to \textit{truly} complement the existing compute resources on FPGAs. Experiments show that adding M4BRAM to a tiled DNN accelerator can achieve an average speedup of 2.16$\times$ across various DNNs on the ImageNet classification task while incurring a negligible accuracy loss of $<$ 0.5%. Compared to the same tiled accelerator that employs a prior compute-in-BRAM architecture, M4BRAM delivers 1.43$\times$ higher performance on average across various DNNs.

翻译：混合精度量化是压缩深度神经网络（DNN）的常用方法。然而，鉴于当前FPGA架构和传统加速器数据流，有效扩展混合精度DNN的性能颇具挑战。本文通过提出M4BRAM——一种可在块内存（BRAM）中进行混合精度矩阵乘法的新型计算架构——增强了FPGA加速混合精度DNN的能力。在精度方面，M4BRAM支持广泛的混合精度DNN配置：权重量化精度可为2/4/8位，而激活量化精度可在2至8位间变化。在数据流方面，M4BRAM利用一种新颖的块内存内数据复制方案实现高硬件利用率。此外，在M4BRAM计算过程中，其他FPGA资源可无缝访问其数据而无需独立缓冲器。因此，与先前基于块内存的计算架构不同，M4BRAM能同时执行混合精度计算并保持作为存储单元的完整功能，从而真正补充FPGA上的现有计算资源。实验表明，在ImageNet分类任务中，将M4BRAM集成至分块式DNN加速器可在各种DNN上实现平均2.16倍的加速比，同时精度损失可忽略不计（低于0.5%）。相较于采用先前基于块内存计算架构的同类分块加速器，M4BRAM在各种DNN上的平均性能提升1.43倍。

相关内容

FPGA

关注 18

FPGA：ACM/SIGDA International Symposium on Field-Programmable Gate Arrays。 Explanation：ACM/SIGDA现场可编程门阵列国际研讨会。 Publisher：ACM。 SIT： http://dblp.uni-trier.de/db/conf/fpga/

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日