Trio-ViT: Post-Training Quantization and Acceleration for Softmax-Free Efficient Vision Transformer

Motivated by the huge success of Transformers in the field of natural language processing (NLP), Vision Transformers (ViTs) have been rapidly developed and achieved remarkable performance in various computer vision tasks. However, their huge model sizes and intensive computations hinder ViTs' deployment on embedded devices, calling for effective model compression methods, such as quantization. Unfortunately, due to the existence of hardware-unfriendly and quantization-sensitive non-linear operations, particularly {Softmax}, it is non-trivial to completely quantize all operations in ViTs, yielding either significant accuracy drops or non-negligible hardware costs. In response to challenges associated with \textit{standard ViTs}, we focus our attention towards the quantization and acceleration for \textit{efficient ViTs}, which not only eliminate the troublesome Softmax but also integrate linear attention with low computational complexity, and propose Trio-ViT accordingly. Specifically, at the algorithm level, we develop a {tailored post-training quantization engine} taking the unique activation distributions of Softmax-free efficient ViTs into full consideration, aiming to boost quantization accuracy. Furthermore, at the hardware level, we build an accelerator dedicated to the specific Convolution-Transformer hybrid architecture of efficient ViTs, thereby enhancing hardware efficiency. Extensive experimental results consistently prove the effectiveness of our Trio-ViT framework. {Particularly, we can gain up to $\uparrow$$\mathbf{3.6}\times$, $\uparrow$$\mathbf{5.0}\times$, and $\uparrow$$\mathbf{7.3}\times$ FPS under comparable accuracy over state-of-the-art ViT accelerators, as well as $\uparrow$$\mathbf{6.0}\times$, $\uparrow$$\mathbf{1.5}\times$, and $\uparrow$$\mathbf{2.1}\times$ DSP efficiency.} Codes are available at \url{https://github.com/shihuihong214/Trio-ViT}.

翻译：受Transformer在自然语言处理（NLP）领域取得的巨大成功启发，视觉Transformer（ViTs）得以迅速发展，并在各种计算机视觉任务中取得了卓越性能。然而，其庞大的模型规模与密集的计算量阻碍了ViTs在嵌入式设备上的部署，这催生了对量化等有效模型压缩方法的需求。遗憾的是，由于存在硬件不友好且对量化敏感的非线性操作（尤其是{Softmax}），要完全量化ViTs中的所有操作并非易事，往往导致显著的精度下降或不可忽视的硬件开销。针对\textit{标准ViTs}相关的挑战，我们将研究重点转向\textit{高效ViTs}的量化与加速。这类模型不仅消除了棘手的Softmax操作，还集成了计算复杂度低的线性注意力机制，并据此提出了Trio-ViT。具体而言，在算法层面，我们开发了一种{定制化的后训练量化引擎}，该引擎充分考虑了无Softmax高效ViTs独特的激活分布，旨在提升量化精度。此外，在硬件层面，我们为高效ViTs特有的卷积-Transformer混合架构构建了一个专用加速器，从而提升硬件效率。大量实验结果一致证明了我们Trio-ViT框架的有效性。{特别地，在可比精度下，相较于最先进的ViT加速器，我们能够实现高达$\uparrow$$\mathbf{3.6}\times$、$\uparrow$$\mathbf{5.0}\times$和$\uparrow$$\mathbf{7.3}\times$的帧率提升，以及$\uparrow$$\mathbf{6.0}\times$、$\uparrow$$\mathbf{1.5}\times$和$\uparrow$$\mathbf{2.1}\times$的DSP效率提升。}代码发布于\url{https://github.com/shihuihong214/Trio-ViT}。