While vision transformers (ViTs) have continuously achieved new milestones in the field of computer vision, their sophisticated network architectures with high computation and memory costs have impeded their deployment on resource-limited edge devices. In this paper, we propose a hardware-efficient image-adaptive token pruning framework called HeatViT for efficient yet accurate ViT acceleration on embedded FPGAs. By analyzing the inherent computational patterns in ViTs, we first design an effective attention-based multi-head token selector, which can be progressively inserted before transformer blocks to dynamically identify and consolidate the non-informative tokens from input images. Moreover, we implement the token selector on hardware by adding miniature control logic to heavily reuse existing hardware components built for the backbone ViT. To improve the hardware efficiency, we further employ 8-bit fixed-point quantization, and propose polynomial approximations with regularization effect on quantization error for the frequently used nonlinear functions in ViTs. Finally, we propose a latency-aware multi-stage training strategy to determine the transformer blocks for inserting token selectors and optimize the desired (average) pruning rates for inserted token selectors, in order to improve both the model accuracy and inference latency on hardware. Compared to existing ViT pruning studies, under the similar computation cost, HeatViT can achieve 0.7%$\sim$8.9% higher accuracy; while under the similar model accuracy, HeatViT can achieve more than 28.4%$\sim$65.3% computation reduction, for various widely used ViTs, including DeiT-T, DeiT-S, DeiT-B, LV-ViT-S, and LV-ViT-M, on the ImageNet dataset. Compared to the baseline hardware accelerator, our implementations of HeatViT on the Xilinx ZCU102 FPGA achieve 3.46$\times$$\sim$4.89$\times$ speedup.
翻译:尽管视觉Transformer(ViT)在计算机视觉领域持续取得新的里程碑式进展,但其复杂的网络架构伴随的高计算与内存开销,阻碍了其在资源受限的边缘设备上的部署。本文提出一种面向硬件的图像自适应令牌剪枝框架HeatViT,用于在嵌入式FPGA上实现高效且精准的ViT加速。通过分析ViT固有的计算模式,我们首先设计了一种基于注意力机制的多头令牌选择器,可渐进式插入Transformer模块之前,动态识别并整合输入图像中的非信息性令牌。此外,我们在硬件上通过添加微型控制逻辑实现了令牌选择器,以高度复用为基础ViT主干构建的现有硬件组件。为提升硬件效率,我们进一步采用8位定点量化,并针对ViT中频繁使用的非线性函数提出具有量化误差正则化效应的多项式近似方法。最后,我们提出一种延迟感知的多阶段训练策略,用于确定需插入令牌选择器的Transformer模块,并优化所选令牌选择器的期望(平均)剪枝率,从而在硬件上兼顾模型精度与推理延迟。与现有ViT剪枝研究相比,在相近计算开销下,HeatViT在ImageNet数据集上针对多种广泛使用的ViT模型(包括DeiT-T、DeiT-S、DeiT-B、LV-ViT-S和LV-ViT-M)可实现0.7%∼8.9%的精度提升;而在相近模型精度下,可实现超过28.4%∼65.3%的计算量削减。与基线硬件加速器相比,我们在Xilinx ZCU102 FPGA上实现的HeatViT可获得3.46×∼4.89×的加速比。