Current transformer accelerators primarily focus on optimizing self-attention due to its quadratic complexity. However, this focus is less relevant for vision transformers with short token lengths, where the Feed-Forward Network (FFN) tends to be the dominant computational bottleneck. This paper presents a low power Vision Transformer accelerator, optimized through algorithm-hardware co-design. The model complexity is reduced using hardware-friendly dynamic token pruning without introducing complex mechanisms. Sparsity is further improved by replacing GELU with ReLU activations and employing dynamic FFN2 pruning, achieving a 61.5\% reduction in operations and a 59.3\% reduction in FFN2 weights, with an accuracy loss of less than 2\%. The hardware adopts a row-wise dataflow with output-oriented data access to eliminate data transposition, and supports dynamic operations with minimal area overhead. Implemented in TSMC's 28nm CMOS technology, our design occupies 496.4K gates and includes a 232KB SRAM buffer, achieving a peak throughput of 1024 GOPS at 1GHz, with an energy efficiency of 2.31 TOPS/W and an area efficiency of 858.61 GOPS/mm2.
翻译:当前Transformer加速器主要针对自注意力机制的二次复杂度进行优化。然而,对于具有较短令牌长度的视觉Transformer,前馈网络往往成为主要计算瓶颈,这使得上述优化重点的相关性降低。本文提出一种通过算法-硬件协同设计优化的低功耗视觉Transformer加速器。模型复杂度通过采用硬件友好的动态令牌剪枝技术得以降低,且无需引入复杂机制。通过将GELU激活函数替换为ReLU并采用动态FFN2剪枝,稀疏性得到进一步提升,实现了运算量减少61.5%、FFN2权重减少59.3%,而精度损失小于2%。硬件采用面向输出的行式数据流以消除数据转置,并以最小面积开销支持动态运算。本设计采用台积电28nm CMOS工艺实现,占用496.4K门电路并包含232KB SRAM缓存,在1GHz频率下峰值吞吐量达1024 GOPS,能效为2.31 TOPS/W,面积效率为858.61 GOPS/mm²。