Quantization offers a practical solution to deploy LLMs in resource-constraint environments. However, its impact on internal representations remains understudied, raising questions about the reliability of quantized models. In this study, we employ a range of interpretability techniques to investigate how quantization affects model and neuron behavior. We analyze multiple LLMs under 4-bit and 8-bit quantization. Our findings reveal that the impact of quantization on model calibration is generally minor. Analysis of neuron activations indicates that the number of dead neurons, i.e., those with activation values close to 0 across the dataset, remains consistent regardless of quantization. In terms of neuron contribution to predictions, we observe that smaller full precision models exhibit fewer salient neurons, whereas larger models tend to have more, with the exception of Llama-2-7B. The effect of quantization on neuron redundancy varies across models. Overall, our findings suggest that effect of quantization may vary by model and tasks, however, we did not observe any drastic change which may discourage the use of quantization as a reliable model compression technique.
翻译:量化为在资源受限环境中部署大型语言模型提供了一种实用解决方案。然而,其对模型内部表征的影响尚未得到充分研究,引发了关于量化模型可靠性的疑问。本研究采用多种可解释性技术,探究量化如何影响模型及神经元行为。我们分析了多个大型语言模型在4位和8位量化下的表现。研究结果表明,量化对模型校准的影响总体较小。对神经元激活的分析显示,无论量化程度如何,死亡神经元(即在数据集中激活值接近零的神经元)的数量保持稳定。在神经元对预测贡献方面,我们发现较小的全精度模型表现出较少的显著神经元,而较大模型通常具有更多显著神经元,但Llama-2-7B模型例外。量化对神经元冗余度的影响因模型而异。总体而言,我们的研究结果表明量化效果可能因模型和任务而异,但未观察到任何可能阻碍量化作为可靠模型压缩技术使用的剧烈变化。