Vision Transformers (ViTs) have achieved state-of-the-art accuracy on various computer vision tasks. However, their high computational complexity prevents them from being applied to many real-world applications. Weight and token pruning are two well-known methods for reducing complexity: weight pruning reduces the model size and associated computational demands, while token pruning further dynamically reduces the computation based on the input. Combining these two techniques should significantly reduce computation complexity and model size; however, naively integrating them results in irregular computation patterns, leading to significant accuracy drops and difficulties in hardware acceleration. Addressing the above challenges, we propose a comprehensive algorithm-hardware codesign for accelerating ViT on FPGA through simultaneous pruning -combining static weight pruning and dynamic token pruning. For algorithm design, we systematically combine a hardware-aware structured block-pruning method for pruning model parameters and a dynamic token pruning method for removing unimportant token vectors. Moreover, we design a novel training algorithm to recover the model's accuracy. For hardware design, we develop a novel hardware accelerator for executing the pruned model. The proposed hardware design employs multi-level parallelism with load balancing strategy to efficiently deal with the irregular computation pattern led by the two pruning approaches. Moreover, we develop an efficient hardware mechanism for efficiently executing the on-the-fly token pruning.
翻译:视觉Transformer(ViTs)已在多项计算机视觉任务中取得最先进精度,但其高计算复杂度阻碍了实际应用。权重剪枝与令牌剪枝是两种降低复杂度的经典方法:前者减小模型规模及计算需求,后者进一步依据输入动态缩减计算量。理论上结合这两种方法可显著降低计算复杂度与模型规模,然而简单集成会导致不规则计算模式,引致精度显著下降并增加硬件加速难度。针对上述挑战,我们提出一种算法-硬件协同设计方案,通过同步剪枝(融合静态权重剪枝与动态令牌剪枝)在FPGA上加速ViT。算法设计层面,我们系统性地结合了面向硬件的结构化块剪枝方法(用于模型参数剪枝)与动态令牌剪枝方法(用于移除不重要令牌向量),并设计新型训练算法恢复模型精度。硬件设计层面,我们开发了针对剪枝模型的新型硬件加速器,通过多级并行与负载均衡策略高效处理两种剪枝方法导致的不规则计算模式,同时设计高效硬件机制实现动态令牌的实时剪枝。