Transformer is an emerging neural network model with attention mechanism. It has been adopted to various tasks and achieved a favorable accuracy compared to CNNs and RNNs. While the attention mechanism is recognized as a general-purpose component, many of the Transformer models require a significant number of parameters compared to the CNN-based ones. To mitigate the computational complexity, recently, a hybrid approach has been proposed, which uses ResNet as a backbone architecture and replaces a part of its convolution layers with an MHSA (Multi-Head Self-Attention) mechanism. In this paper, we significantly reduce the parameter size of such models by using Neural ODE (Ordinary Differential Equation) as a backbone architecture instead of ResNet. The proposed hybrid model reduces the parameter size by 94.6% compared to the CNN-based ones without degrading the accuracy. We then deploy the proposed model on a modest-sized FPGA device for edge computing. To further reduce FPGA resource utilization, we quantize the model following QAT (Quantization Aware Training) scheme instead of PTQ (Post Training Quantization) to suppress the accuracy loss. As a result, an extremely lightweight Transformer-based model can be implemented on resource-limited FPGAs. The weights of the feature extraction network are stored on-chip to minimize the memory transfer overhead, allowing faster inference. By eliminating the overhead of memory transfers, inference can be executed seamlessly, leading to accelerated inference. The proposed FPGA implementation achieves 12.8x speedup and 9.21x energy efficiency compared to ARM Cortex-A53 CPU.
翻译:Transformer是一种具有注意力机制的新兴神经网络模型,已被应用于各类任务中,相比CNN和RNN取得了更高的准确率。尽管注意力机制被公认为通用组件,但许多Transformer模型相比基于CNN的模型仍需要更多参数。为降低计算复杂度,近期提出了一种混合方法,该方法以ResNet为主干架构,并将其部分卷积层替换为MHSA(多头自注意力)机制。本文通过采用神经常微分方程(Neural ODE)替代ResNet作为主干架构,显著减小了此类模型的参数规模。所提混合模型相比基于CNN的模型参数减少了94.6%,且未降低准确率。随后我们将该模型部署于中等规模的FPGA器件进行边缘计算。为进一步降低FPGA资源占用,我们采用QAT(量化感知训练)方案而非PTQ(训练后量化)对模型进行量化以抑制精度损失。最终,基于Transformer的超轻量级模型可在资源受限的FPGA上实现。特征提取网络的权重存储在片上以最小化内存传输开销,从而实现更快的推理。通过消除内存传输开销,推理可无缝执行,进而加速推理过程。所提FPGA实现相比ARM Cortex-A53 CPU实现了12.8倍加速和9.21倍能效提升。