Generative Language Models (GLMs) have shown impressive performance in tasks such as text generation, understanding, and reasoning. However, the large model size poses challenges for practical deployment. To solve this problem, Quantization-Aware Training (QAT) has become increasingly popular. However, current QAT methods for generative models have resulted in a noticeable loss of accuracy. To counteract this issue, we propose a novel knowledge distillation method specifically designed for GLMs. Our method, called token-scaled logit distillation, prevents overfitting and provides superior learning from the teacher model and ground truth. This research marks the first evaluation of ternary weight quantization-aware training of large-scale GLMs with less than 1.0 degradation in perplexity and achieves enhanced accuracy in tasks like common-sense QA and arithmetic reasoning as well as natural language understanding. Our code is available at https://github.com/aiha-lab/TSLD.
翻译:生成式语言模型(GLMs)在文本生成、理解与推理等任务中展现出卓越性能,然而其庞大的模型规模为实际部署带来了挑战。为解决该问题,量化感知训练(QAT)技术日益普及。但现有面向生成式模型的QAT方法仍存在显著的精度损失。针对这一缺陷,我们提出了一种专为GLMs设计的新型知识蒸馏方法——令牌缩放对数蒸馏(token-scaled logit distillation)。该方法能有效防止过拟合,并从教师模型与真实标签中实现更优的学习效果。本研究首次对大规模GLMs的三值权重量化感知训练进行评估,在困惑度指标上实现了低于1.0的退化,同时在常识问答、算术推理及自然语言理解等任务中取得了更高的精度。我们的代码已在 https://github.com/aiha-lab/TSLD 开源。