Transformer-based models have gained widespread popularity in both the computer vision (CV) and natural language processing (NLP) fields. However, significant challenges arise during post-training linear quantization, leading to noticeable reductions in inference accuracy. Our study focuses on uncovering the underlying causes of these accuracy drops and proposing a quantization-friendly fine-tuning method, \textbf{QuantTune}. Firstly, our analysis revealed that, on average, 65\% of quantization errors result from the precision loss incurred by the dynamic range amplification effect of outliers across the target Transformer-based models. Secondly, \textbf{QuantTune} adjusts weights based on the deviation of outlier activations and effectively constrains the dynamic ranges of the problematic activations. As a result, it successfully mitigates the negative impact of outliers on the inference accuracy of quantized models. Lastly, \textbf{QuantTune} can be seamlessly integrated into the back-propagation pass in the fine-tuning process without requiring extra complexity in inference software and hardware design. Our approach showcases significant improvements in post-training quantization across a range of Transformer-based models, including ViT, Bert-base, and OPT. QuantTune reduces accuracy drops by 12.09\% at 8-bit quantization and 33.8\% at 7-bit compared to top calibration methods, outperforming state-of-the-art solutions by over 18.84\% across ViT models.
翻译:基于Transformer的模型已在计算机视觉(CV)和自然语言处理(NLP)领域得到广泛应用。然而,在后训练线性量化过程中会出现显著挑战,导致推理精度明显下降。本研究聚焦于揭示这些精度下降的根本原因,并提出一种量化友好的微调方法——\textbf{QuantTune}。首先,我们的分析表明,目标Transformer模型中由离群值的动态范围放大效应引起的精度损失,平均占量化总误差的65%。其次,\textbf{QuantTune}根据离群激活值的偏差调整权重,有效约束了问题激活值的动态范围,从而成功减轻了离群值对量化模型推理精度的负面影响。最后,\textbf{QuantTune}可无缝集成到微调过程的反向传播环节中,无需增加推理软件和硬件设计的复杂度。我们的方法在包括ViT、Bert-base和OPT在内的多种Transformer模型的后训练量化中均展现出显著改进。与顶尖校准方法相比,QuantTune在8位量化下将精度降幅降低12.09%,在7位量化下降低33.8%,在ViT模型上以超过18.84%的优势优于现有最优方案。