Despite the superior performance, Large Language Models~(LLMs) require significant computational resources for deployment and use. To overcome this issue, quantization methods have been widely applied to reduce the memory footprint of LLMs as well as increasing the inference rate. However, a major challenge is that low-bit quantization methods often lead to performance degradation. It is important to understand how quantization impacts the capacity of LLMs. Different from previous studies focused on overall performance, this work aims to investigate the impact of quantization on \emph{emergent abilities}, which are important characteristics that distinguish LLMs from small language models. Specially, we examine the abilities of in-context learning, chain-of-thought reasoning, and instruction-following in quantized LLMs. Our empirical experiments show that these emergent abilities still exist in 4-bit quantization models, while 2-bit models encounter severe performance degradation on the test of these abilities. To improve the performance of low-bit models, we conduct two special experiments: (1) fine-gained impact analysis that studies which components (or substructures) are more sensitive to quantization, and (2) performance compensation through model fine-tuning. Our work derives a series of important findings to understand the impact of quantization on emergent abilities, and sheds lights on the possibilities of extremely low-bit quantization for LLMs.
翻译:尽管大型语言模型表现卓越,但其部署和使用需要大量的计算资源。为解决这一问题,量化方法已被广泛用于降低大型语言模型的内存占用并提升推理速度。然而,一个主要挑战是低位量化方法常常导致性能下降。理解量化如何影响大型语言模型的能力至关重要。不同于以往关注整体性能的研究,本文旨在探讨量化对“涌现能力”的影响,这些能力是区分大型语言模型与小型语言模型的重要特征。具体而言,我们检验了量化模型中上下文学习、思维链推理和指令遵循的能力。我们的实证实验表明,这些涌现能力在4位量化模型中仍然存在,而2位模型在上述能力测试中遭遇了严重的性能退化。为提升低位模型的性能,我们进行了两项特殊实验:(1)细粒度影响分析,研究哪些组件(或子结构)对量化更为敏感;(2)通过模型微调进行性能补偿。我们的研究得出了一系列重要发现,以理解量化对涌现能力的影响,并为大型语言模型实现极低位量化提供了可能性。