We present a simple variable quantization approach that quantizes different layers of a large language model (LLM) at different bit levels. Specifically, we quantize the most important layers to higher bit precision and less important layers to lower bits to achieve floating point quantization levels. We propose two effective strategies to measure the importance of layers within LLMs: the first measures the importance of a layer based on how different its output embeddings are from the input embeddings (the higher the better); the second estimates the importance of a layer using the number of layer weights that are much larger than average (the smaller the better). We show that quantizing different layers at varying bits according to our importance scores results in minimal performance drop with a far more compressed model size. Finally, we present several practical key takeaways from our variable layer-wise quantization experiments: (a) LLM performance under variable quantization remains close to the original model until 25-50% of layers are moved in lower quantization using our proposed ordering but only until 5-10% if moved using no specific ordering; (b) Quantizing LLMs to lower bits performs substantially better than pruning unless extreme quantization (2-bit) is used; and (c) Layer-wise quantization to lower bits works better in the case of larger LLMs with more layers compared to smaller LLMs with fewer layers. The code used to run the experiments is available at: https://github.com/RazvanDu/LayerwiseQuant.
翻译:本文提出一种简单的可变量化方法,该方法以不同比特级别对大型语言模型(LLM)的不同层进行量化。具体而言,我们将最重要的层量化为较高比特精度,将较不重要的层量化为较低比特,从而实现浮点数量化级别。我们提出两种有效策略来衡量LLM中各层的重要性:第一种策略通过层输出嵌入与输入嵌入的差异程度来衡量层的重要性(差异越大越好);第二种策略通过远大于平均值的权重数量来估计层的重要性(数量越小越好)。实验表明,根据我们提出的重要性分数对不同的层进行可变比特量化,能够在实现更小模型尺寸的同时保持性能下降最小。最后,我们从可变层量化实验中总结出若干关键实践结论:(a)采用我们提出的排序方法时,即使将25-50%的层移至较低量化级别,LLM在可变量化下的性能仍能接近原始模型;而若无特定排序,仅移动5-10%的层就会导致性能下降;(b)除非使用极端量化(2比特),否则将LLM量化至较低比特的表现显著优于剪枝方法;(c)与层数较少的小型LLM相比,层量化至较低比特的方案在层数较多的大型LLM中效果更佳。实验所用代码已发布于:https://github.com/RazvanDu/LayerwiseQuant。