Generative Language Models (GLMs) have shown impressive performance in tasks such as text generation, understanding, and reasoning. However, the large model size poses challenges for practical deployment. To solve this problem, Quantization-Aware Training (QAT) has become increasingly popular. However, current QAT methods for generative models have resulted in a noticeable loss of accuracy. To counteract this issue, we propose a novel knowledge distillation method specifically designed for GLMs. Our method, called token-scaled logit distillation, prevents overfitting and provides superior learning from the teacher model and ground truth. This research marks the first evaluation of ternary weight quantization-aware training of large-scale GLMs with less than 1.0 degradation in perplexity and no loss of accuracy in a reasoning task.
翻译:生成式语言模型(GLMs)在文本生成、理解与推理等任务中展现出卓越性能。然而,模型规模庞大对实际部署构成挑战。为解决该问题,量化感知训练(QAT)方法日益普及。但现有面向生成模型的QAT方法会导致显著的精度损失。针对这一难题,我们提出了一种专为GLMs设计的新型知识蒸馏方法。该方法名为令牌缩放逻辑蒸馏,能有效防止过拟合,并提升从教师模型与真实标签中学习的效果。本研究首次实现对大规模GLMs进行三元权值量化感知训练的评估,在困惑度下降不超过1.0且推理任务精度无损失的情况下完成训练。