Recent studies introduced effective compression techniques for Large Language Models (LLMs) via post-training quantization or low-bit weight representation. Although quantized weights offer storage efficiency and allow for faster inference, existing works have indicated that quantization might compromise performance and exacerbate biases in LLMs. This study investigates the confidence and calibration of quantized models, considering factors such as language model type and scale as contributors to quantization loss. Firstly, we reveal that quantization with GPTQ to 4-bit results in a decrease in confidence regarding true labels, with varying impacts observed among different language models. Secondly, we observe fluctuations in the impact on confidence across different scales. Finally, we propose an explanation for quantization loss based on confidence levels, indicating that quantization disproportionately affects samples where the full model exhibited low confidence levels in the first place.
翻译:近期研究通过训练后量化或低比特权重表示,提出了针对大语言模型(LLMs)的有效压缩技术。尽管量化权重能提升存储效率并加速推理,但现有研究表明量化可能损害模型性能并加剧偏差。本研究探讨了量化模型的置信度与校准问题,将语言模型类型与规模等因素视为量化损失的成因。首先,我们揭示了采用GPTQ进行4比特量化会导致模型对真实标签的置信度下降,且不同语言模型所受影响存在差异。其次,我们观察到不同规模下置信度受影响的程度存在波动。最后,我们基于置信度水平提出了一种量化损失的解释,表明量化对原始全精度模型置信度较低的样本具有不成比例的影响。