Albeit the scalable performance of vision transformers (ViTs), the dense computational costs (training & inference) undermine their position in industrial applications. Post-training quantization (PTQ), tuning ViTs with a tiny dataset and running in a low-bit format, well addresses the cost issue but unluckily bears more performance drops in lower-bit cases. In this paper, we introduce I&S-ViT, a novel method that regulates the PTQ of ViTs in an inclusive and stable fashion. I&S-ViT first identifies two issues in the PTQ of ViTs: (1) Quantization inefficiency in the prevalent log2 quantizer for post-Softmax activations; (2) Rugged and magnified loss landscape in coarse-grained quantization granularity for post-LayerNorm activations. Then, I&S-ViT addresses these issues by introducing: (1) A novel shift-uniform-log2 quantizer (SULQ) that incorporates a shift mechanism followed by uniform quantization to achieve both an inclusive domain representation and accurate distribution approximation; (2) A three-stage smooth optimization strategy (SOS) that amalgamates the strengths of channel-wise and layer-wise quantization to enable stable learning. Comprehensive evaluations across diverse vision tasks validate I&S-ViT' superiority over existing PTQ of ViTs methods, particularly in low-bit scenarios. For instance, I&S-ViT elevates the performance of 3-bit ViT-B by an impressive 50.68%.
翻译:尽管视觉Transformer(ViTs)具有可扩展的性能,但其密集的计算成本(训练与推理)削弱了其在工业应用中的地位。后训练量化(PTQ)通过使用微小数据集对ViT进行调优并以低比特格式运行,很好地解决了成本问题,但不幸的是在更低比特情况下会承受更大的性能下降。本文提出I&S-ViT,一种以包容且稳定的方式调控ViT后训练量化的新方法。I&S-ViT首先识别出ViT后训练量化中的两个问题:(1)后Softmax激活中普遍使用的log2量化器的量化效率低下;(2)后LayerNorm激活中粗粒度量化粒度导致的崎岖且放大的损失景观。随后,I&S-ViT通过引入以下方法解决这些问题:(1)一种新颖的移位-均匀-log2量化器(SULQ),它结合了移位机制与均匀量化,以实现包容性的域表示和精确的分布近似;(2)一种三阶段平滑优化策略(SOS),融合了通道级和层级量化的优势,以实现稳定的学习。跨多种视觉任务的综合评估验证了I&S-ViT相对于现有ViT后训练量化方法的优越性,尤其在低比特场景下。例如,I&S-ViT将3比特ViT-B的性能显著提升了50.68%。