Q-HyViT: Post-Training Quantization of Hybrid Vision Transformers with Bridge Block Reconstruction for IoT Systems

Recently, vision transformers (ViTs) have superseded convolutional neural networks in numerous applications, including classification, detection, and segmentation. However, the high computational requirements of ViTs hinder their widespread implementation. To address this issue, researchers have proposed efficient hybrid transformer architectures that combine convolutional and transformer layers with optimized attention computation of linear complexity. Additionally, post-training quantization has been proposed as a means of mitigating computational demands. For mobile devices, achieving optimal acceleration for ViTs necessitates the strategic integration of quantization techniques and efficient hybrid transformer structures. However, no prior investigation has applied quantization to efficient hybrid transformers. In this paper, we discover that applying existing post-training quantization (PTQ) methods for ViTs to efficient hybrid transformers leads to a drastic accuracy drop, attributed to the four following challenges: (i) highly dynamic ranges, (ii) zero-point overflow, (iii) diverse normalization, and (iv) limited model parameters ($<$5M). To overcome these challenges, we propose a new post-training quantization method, which is the first to quantize efficient hybrid ViTs (MobileViTv1, MobileViTv2, Mobile-Former, EfficientFormerV1, EfficientFormerV2). We achieve a significant improvement of 17.73% for 8-bit and 29.75% for 6-bit on average, respectively, compared with existing PTQ methods (EasyQuant, FQ-ViT, PTQ4ViT, and RepQ-ViT)}. We plan to release our code at https://gitlab.com/ones-ai/q-hyvit.

翻译：近期，视觉Transformer（ViTs）已在分类、检测与分割等众多任务中取代卷积神经网络。然而，ViT的高计算需求阻碍了其大规模部署。为解决此问题，研究者提出了高效混合Transformer架构，通过结合卷积层与Transformer层并采用线性复杂度的优化注意力计算。此外，后训练量化作为降低计算负载的手段被提出。对于移动设备，实现ViT最优加速需要量化技术与高效混合Transformer结构的战略整合。然而，此前尚无研究将量化应用于高效混合Transformer。本文发现，将现有面向ViT的后训练量化（PTQ）方法直接应用于高效混合Transformer会导致严重的精度下降，其原因可归结为以下四个挑战：（i）高度动态范围、（ii）零点溢出、（iii）多样化的归一化层及（iv）有限的模型参数（<5M）。为克服这些挑战，我们提出了一种新的后训练量化方法，该方法首次实现对高效混合ViT（MobileViTv1、MobileViTv2、Mobile-Former、EfficientFormerV1、EfficientFormerV2）的量化。与现有PTQ方法（EasyQuant、FQ-ViT、PTQ4ViT和RepQ-ViT）相比，我们在8比特和6比特量化下分别实现了平均17.73%和29.75%的显著精度提升。我们计划在https://gitlab.com/ones-ai/q-hyvit 开源代码。