Layer-Wise Quantization: A Pragmatic and Effective Method for Quantizing LLMs Beyond Integer Bit-Levels

We present a simple variable quantization approach that quantizes different layers of a large language model (LLM) at different bit levels. Specifically, we quantize the most important layers to higher bit precision and less important layers to lower bits to achieve floating point quantization levels. We propose two effective strategies to measure the importance of layers within LLMs: the first measures the importance of a layer based on how different its output embeddings are from the input embeddings (the higher the better); the second estimates the importance of a layer using the number of layer weights that are much larger than average (the smaller the better). We show that quantizing different layers at varying bits according to our importance scores results in minimal performance drop with a far more compressed model size. Finally, we present several practical key takeaways from our variable layer-wise quantization experiments: (a) LLM performance under variable quantization remains close to the original model until 25-50% of layers are moved in lower quantization using our proposed ordering but only until 5-10% if moved using no specific ordering; (b) Quantizing LLMs to lower bits performs substantially better than pruning unless extreme quantization (2-bit) is used; and (c) Layer-wise quantization to lower bits works better in the case of larger LLMs with more layers compared to smaller LLMs with fewer layers. The code used to run the experiments is available at: https://github.com/RazvanDu/LayerwiseQuant.

翻译：本文提出一种简单的变量量化方法，该方法对大语言模型（LLM）的不同层采用不同的比特级别进行量化。具体而言，我们将最重要的层量化为更高比特精度，而将较不重要的层量化为更低比特，从而实现浮点数量化级别。我们提出了两种有效策略来衡量LLM中各层的重要性：第一种策略基于层的输出嵌入与输入嵌入的差异程度来衡量其重要性（差异越大越好）；第二种策略通过统计层中权重显著大于平均值的数量来估计层的重要性（数量越小越好）。实验表明，根据我们提出的重要性评分对不同层进行变比特量化，能在实现更大幅度模型压缩的同时，仅带来极小的性能下降。最后，我们从变量分层量化实验中总结出若干实用要点：(a) 采用我们提出的排序方法时，即使将25-50%的层移至更低量化级别，LLM在变量量化下的性能仍能接近原始模型；但若无特定排序，仅移动5-10%的层就会导致性能明显下降；(b) 除非采用极端量化（2比特），否则将LLM量化至更低比特的表现显著优于剪枝；(c) 与层数较少的小型LLM相比，分层低比特量化在层数更多的大型LLM上效果更佳。实验所用代码已公开于：https://github.com/RazvanDu/LayerwiseQuant。