Pre-trained Transformer models such as BERT have shown great success in a wide range of applications, but at the cost of substantial increases in model complexity. Quantization-aware training (QAT) is a promising method to lower the implementation cost and energy consumption. However, aggressive quantization below 2-bit causes considerable accuracy degradation due to unstable convergence, especially when the downstream dataset is not abundant. This work proposes a proactive knowledge distillation method called Teacher Intervention (TI) for fast converging QAT of ultra-low precision pre-trained Transformers. TI intervenes layer-wise signal propagation with the intact signal from the teacher to remove the interference of propagated quantization errors, smoothing loss surface of QAT and expediting the convergence. Furthermore, we propose a gradual intervention mechanism to stabilize the recovery of subsections of Transformer layers from quantization. The proposed schemes enable fast convergence of QAT and improve the model accuracy regardless of the diverse characteristics of downstream fine-tuning tasks. We demonstrate that TI consistently achieves superior accuracy with significantly lower fine-tuning iterations on well-known Transformers of natural language processing as well as computer vision compared to the state-of-the-art QAT methods.
翻译:预训练Transformer模型(如BERT)在广泛应用中取得了巨大成功,但代价是模型复杂度显著增加。量化感知训练(QAT)是一种有前景的实现成本降低与能耗优化的方法,然而,当量化位宽低于2比特时,由于收敛不稳定,尤其是在下游数据集不充足的情况下,会导致严重的精度下降。本文提出一种名为教师干预(TI)的主动知识蒸馏方法,用于超低精度预训练Transformer的快速收敛QAT。TI通过教师模型的完整信号逐层干预信号传播,消除传播量化误差的干扰,平滑QAT的损失曲面并加速收敛。此外,我们提出渐进式干预机制以稳定Transformer子层从量化中恢复的过程。所提方案能使QAT快速收敛,且无论下游微调任务的特性如何,均能提升模型精度。实验表明,在自然语言处理与计算机视觉领域的知名Transformer模型上,与现有最优QAT方法相比,TI能以显著更少的微调迭代次数持续实现卓越的精度。