Since introduced, Swin Transformer has achieved remarkable results in the field of computer vision, it has sparked the need for dedicated hardware accelerators, specifically catering to edge computing demands. For the advantages of flexibility, low power consumption, FPGAs have been widely employed to accelerate the inference of convolutional neural networks (CNNs) and show potential in Transformer-based models. Unlike CNNs, which mainly involve multiply and accumulate (MAC) operations, Transformer involve non-linear computations such as Layer Normalization (LN), Softmax, and GELU. These nonlinear computations do pose challenges for accelerator design. In this paper, to propose an efficient FPGA-based hardware accelerator for Swin Transformer, we focused on using different strategies to deal with these nonlinear calculations and efficiently handling MAC computations to achieve the best acceleration results. We replaced LN with BN, Given that Batch Normalization (BN) can be fused with linear layers during inference to optimize inference efficiency. The modified Swin-T, Swin-S, and Swin-B respectively achieved Top-1 accuracy rates of 80.7%, 82.7%, and 82.8% in ImageNet. Furthermore, We employed strategies for approximate computation to design hardware-friendly architectures for Softmax and GELU computations. We also designed an efficient Matrix Multiplication Unit to handle all linear computations in Swin Transformer. As a conclude, compared with CPU (AMD Ryzen 5700X), our accelerator achieved 1.76x, 1.66x, and 1.25x speedup and achieved 20.45x, 18.60x, and 14.63x energy efficiency (FPS/power consumption) improvement on Swin-T, Swin-S, and Swin-B models, respectively. Compared to GPU (Nvidia RTX 2080 Ti), we achieved 5.05x, 4.42x, and 3.00x energy efficiency improvement respectively. As far as we know, the accelerator we proposed is the fastest FPGA-based accelerator for Swin Transformer.
翻译:自提出以来,Swin Transformer在计算机视觉领域取得了显著成果,这催生了针对边缘计算需求的专用硬件加速器需求。凭借灵活性、低功耗等优势,FPGA被广泛用于加速卷积神经网络(CNN)推理,并在基于Transformer的模型中展现出潜力。与主要涉及乘加(MAC)运算的CNN不同,Transformer包含层归一化(LN)、Softmax和GELU等非线性计算。这些非线性计算给加速器设计带来了挑战。为提出一种高效的基于FPGA的Swin Transformer硬件加速器,本文聚焦于采用不同策略处理这些非线性计算并高效处理MAC运算以实现最佳加速效果。我们将LN替换为BN,由于批归一化(BN)可在推理阶段与线性层融合以优化推理效率,改进后的Swin-T、Swin-S和Swin-B在ImageNet上分别达到80.7%、82.7%和82.8%的Top-1准确率。此外,我们采用近似计算策略为Softmax和GELU计算设计了硬件友好的架构,同时设计了高效的矩阵乘法单元来处理Swin Transformer中的所有线性计算。最终,与CPU(AMD Ryzen 5700X)相比,我们的加速器在Swin-T、Swin-S和Swin-B模型上分别实现了1.76倍、1.66倍和1.25倍加速,以及20.45倍、18.60倍和14.63倍的能效(FPS/功耗)提升;与GPU(Nvidia RTX 2080 Ti)相比,分别实现了5.05倍、4.42倍和3.00倍的能效提升。据我们所知,本文提出的加速器是当前最快的基于FPGA的Swin Transformer加速器。