Design and Optimization of Residual Neural Network Accelerators for Low-Power FPGAs Using High-Level Synthesis

Residual neural networks are widely used in computer vision tasks. They enable the construction of deeper and more accurate models by mitigating the vanishing gradient problem. Their main innovation is the residual block which allows the output of one layer to bypass one or more intermediate layers and be added to the output of a later layer. Their complex structure and the buffering required by the residual block make them difficult to implement on resource-constrained platforms. We present a novel design flow for implementing deep learning models for field programmable gate arrays optimized for ResNets, using a strategy to reduce their buffering overhead to obtain a resource-efficient implementation of the residual layer. Our high-level synthesis (HLS)-based flow encompasses a thorough set of design principles and optimization strategies, exploiting in novel ways standard techniques such as temporal reuse and loop merging to efficiently map ResNet models, and potentially other skip connection-based NN architectures, into FPGA. The models are quantized to 8-bit integers for both weights and activations, 16-bit for biases, and 32-bit for accumulations. The experimental results are obtained on the CIFAR-10 dataset using ResNet8 and ResNet20 implemented with Xilinx FPGAs using HLS on the Ultra96-V2 and Kria KV260 boards. Compared to the state-of-the-art on the Kria KV260 board, our ResNet20 implementation achieves 2.88X speedup with 0.5% higher accuracy of 91.3%, while ResNet8 accuracy improves by 2.8% to 88.7%. The throughputs of ResNet8 and ResNet20 are 12971 FPS and 3254 FPS on the Ultra96 board, and 30153 FPS and 7601 FPS on the Kria KV26, respectively. They Pareto-dominate state-of-the-art solutions concerning accuracy, throughput, and energy.

翻译：残差神经网络广泛应用于计算机视觉任务，通过缓解梯度消失问题实现了更深层、更精确的模型构建。其核心创新在于残差块结构，该结构允许某一层的输出跨越一个或多个中间层，直接与后续层输出相加。然而，残差网络的复杂结构及其所需的缓冲机制，使其在资源受限平台上难以部署。本文提出一种新颖的设计流程，针对现场可编程门阵列实现面向ResNet优化的深度学习模型，采用降低缓冲开销的策略实现残差层的资源高效实现。基于高层次综合（HLS）的流程包含一套完整的设计原则与优化策略，创新性地利用时序复用和循环融合等标准技术，高效地将ResNet模型及其他基于跳跃连接的神经网络架构映射至FPGA。模型采用8位整数量化权重和激活值、16位量化偏置、32位量化累加结果。在CIFAR-10数据集上，使用Xilinx FPGA通过HLS在Ultra96-V2和Kria KV260开发板上实现的ResNet8/ResNet20获得实验数据：相较于Kria KV260平台的现有最优方案，ResNet20实现获得2.88倍加速比，准确率提升0.5%至91.3%；ResNet8准确率提升2.8%至88.7%。在Ultra96开发板上，ResNet8和ResNet20的吞吐量分别为12971 FPS和3254 FPS；在Kria KV260开发板上则分别达到30153 FPS和7601 FPS。在准确率、吞吐量和能耗指标上，本方案均帕累托优于现有技术方案。