Reconfigurable architectures like Field Programmable Gate Arrays (FPGAs) have been used for accelerating computations in several domains because of their unique combination of flexibility, performance, and power efficiency. However, FPGAs have not been widely used for high-performance computing, primarily because of their programming complexity and difficulties in optimizing performance. We optimize Tensil AI's open-source inference accelerator for maximum performance using ResNet20 trained on CIFAR in this paper in order to gain insight into the use of FPGAs for high-performance computing. In this paper, we show how improving hardware design, using Xilinx Ultra RAM, and using advanced compiler strategies can lead to improved inference performance. We also demonstrate that running the CIFAR test data set shows very little accuracy drop when rounding down from the original 32-bit floating point. The heterogeneous computing model in our platform allows us to achieve a frame rate of 293.58 frames per second (FPS) and a %90 accuracy on a ResNet20 trained using CIFAR. The experimental results show that the proposed accelerator achieves a throughput of 21.12 Giga-Operations Per Second (GOP/s) with a 5.21 W on-chip power consumption at 100 MHz. The comparison results with off-the-shelf devices and recent state-of-the-art implementations illustrate that the proposed accelerator has obvious advantages in terms of energy efficiency.
翻译:可重构架构(如现场可编程门阵列FPGA)因其兼具灵活性、高性能和能效优势,已被用于多个领域的计算加速。然而,由于编程复杂度和性能优化困难,FPGA在高性能计算领域尚未得到广泛应用。本文以采用CIFAR数据集训练的ResNet20网络为对象,优化Tensil AI开源推理加速器的性能上限,以探究FPGA在高性能计算中的应用。通过改进硬件设计、采用Xilinx Ultra RAM及高级编译策略,我们展示了如何提升推理性能。同时证明,将原始32位浮点数向下取整后,CIFAR测试数据集的精度损失极低。基于异构计算模型,我们的平台在ResNet20/CIFAR模型上实现了293.58帧/秒(FPS)的帧率和90%的准确率。实验结果表明,该加速器在100 MHz频率下可达到21.12 Giga运算/秒(GOP/s)的吞吐量,芯片功耗仅5.21 W。与现有商用设备及最新前沿实现的对比显示,本加速器在能效方面具有显著优势。