Table-Lookup MAC: Scalable Processing of Quantised Neural Networks in FPGA Soft Logic

Recent advancements in neural network quantisation have yielded remarkable outcomes, with three-bit networks reaching state-of-the-art full-precision accuracy in complex tasks. These achievements present valuable opportunities for accelerating neural networks by computing in reduced precision. Implementing it on FPGAs can take advantage of bit-level reconfigurability, which is not available on conventional CPUs and GPUs. Simultaneously, the high data intensity of neural network processing has inspired computing-in-memory paradigms, including on FPGA platforms. By programming the effects of trained model weights as lookup operations in soft logic, the transfer of weight data from memory units can be avoided, alleviating the memory bottleneck. However, previous methods face poor scalability - the high logic utilisation limiting them to small networks/sub-networks of binary models with low accuracy. In this paper, we introduce Table Lookup Multiply-Accumulate (TLMAC) as a framework to compile and optimise quantised neural networks for scalable lookup-based processing. TLMAC clusters and maps unique groups of weights to lookup-based processing elements, enabling highly parallel computation while taking advantage of parameter redundancy. Further place and route algorithms are proposed to reduce LUT utilisation and routing congestion. We demonstrate that TLMAC significantly improves the scalability of previous related works. Our efficient logic mapping and high degree of reuse enables entire ImageNet-scale quantised models with full-precision accuracy to be implemented using lookup-based computing on one commercially available FPGA.

翻译：神经网络量化技术的最新进展取得了显著成果，三比特网络在复杂任务中达到了与全精度相当的最先进精度。这些成就为通过低精度计算加速神经网络提供了宝贵机遇。在FPGA上实现该方法可充分利用传统CPU和GPU所不具备的比特级可重构能力。同时，神经网络处理的高数据密度特性催生了包括FPGA平台在内的存内计算范式。通过将训练模型权重的效果编程为软逻辑中的查找操作，可避免从存储单元传输权重数据，从而缓解内存瓶颈。然而，现有方法面临可扩展性差的问题——高逻辑资源占用使其仅适用于低精度的二元模型子网络。本文提出表查找乘累加（TLMAC）框架，用于编译和优化量化神经网络以实现可扩展的查找式处理。TLMAC通过聚类并映射独特权重组合至查找式处理单元，在利用参数冗余性的同时实现高度并行计算。本文进一步提出布局布线算法以减少LUT占用和布线拥塞。实验证明，TLMAC显著提升了现有相关工作的可扩展性。通过高效的逻辑映射和高复用度，我们成功在单个商用FPGA上实现了基于查找式计算、达到全精度精度的完整ImageNet规模量化模型。

相关内容

Networking

关注 23

Networking：IFIP International Conferences on Networking。 Explanation：国际网络会议。 Publisher：IFIP。 SIT： http://dblp.uni-trier.de/db/conf/networking/index.html

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

O’Reilly报告：知识图谱崛起——面向现代数据集成和数据结构体系，“The Rise of the Knowledge Graph——Toward Modern Data Integration and the Data Fabric Architecture”

专知会员服务

49+阅读 · 2022年2月18日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日