Residual neural networks are widely used in computer vision tasks. They enable the construction of deeper and more accurate models by mitigating the vanishing gradient problem. Their main innovation is the residual block which allows the output of one layer to bypass one or more intermediate layers and be added to the output of a later layer. Their complex structure and the buffering required by the residual block make them difficult to implement on resource-constrained platforms. We present a novel design flow for implementing deep learning models for field programmable gate arrays optimized for ResNets, using a strategy to reduce their buffering overhead to obtain a resource-efficient implementation of the residual layer. Our high-level synthesis (HLS)-based flow encompasses a thorough set of design principles and optimization strategies, exploiting in novel ways standard techniques such as temporal reuse and loop merging to efficiently map ResNet models, and potentially other skip connection-based NN architectures, into FPGA. The models are quantized to 8-bit integers for both weights and activations, 16-bit for biases, and 32-bit for accumulations. The experimental results are obtained on the CIFAR-10 dataset using ResNet8 and ResNet20 implemented with Xilinx FPGAs using HLS on the Ultra96-V2 and Kria KV260 boards. Compared to the state-of-the-art on the Kria KV260 board, our ResNet20 implementation achieves 2.88X speedup with 0.5% higher accuracy of 91.3%, while ResNet8 accuracy improves by 2.8% to 88.7%. The throughputs of ResNet8 and ResNet20 are 12971 FPS and 3254 FPS on the Ultra96 board, and 30153 FPS and 7601 FPS on the Kria KV26, respectively. They Pareto-dominate state-of-the-art solutions concerning accuracy, throughput, and energy.
翻译:残差神经网络广泛应用于计算机视觉任务。它通过缓解梯度消失问题,支持构建更深层、更精准的模型。其核心创新在于残差块结构,该结构允许某一层的输出跳过若干中间层,与后续层的输出相加。然而,这种复杂结构及残差块所需的缓冲机制,使其难以在资源受限平台上实现。本文提出一种面向现场可编程门阵列的新型深度学习模型实现流程,该流程针对ResNet进行优化,通过减少缓冲开销的策略实现残差层的资源高效部署。我们基于高层次综合(HLS)的流程涵盖了一套完整的设计原则与优化策略,以创新方式运用时间复用与循环融合等标准技术,高效地将ResNet模型(以及潜在的其它基于跳跃连接的神经网络架构)映射至FPGA。模型中权重与激活值均量化为8位整数,偏置为16位整数,累加操作采用32位整数。实验基于CIFAR-10数据集,在Ultra96-V2与Kria KV260开发板上,分别通过HLS实现ResNet8与ResNet20并部署于Xilinx FPGA。与Kria KV260板上的现有最优方案相比,我们的ResNet20实现获得2.88倍加速,准确率提升0.5%达91.3%;ResNet8准确率提升2.8%至88.7%。在Ultra96板卡上,ResNet8与ResNet20的吞吐量分别达12971 FPS与3254 FPS;在Kria KV26板卡上则分别达30153 FPS与7601 FPS。在准确率、吞吐量与能效方面,本方案均Pareto主导现有最优方案。