The emergence of accurate open large language models (LLMs) has led to a race towards quantization techniques for such models enabling execution on end-user devices. In this paper, we revisit the problem of "extreme" LLM compression--defined as targeting extremely low bit counts, such as 2 to 3 bits per parameter, from the point of view of classic methods in Multi-Codebook Quantization (MCQ). Our work builds on top of Additive Quantization, a classic algorithm from the MCQ family, and adapts it to the quantization of language models. The resulting algorithm advances the state-of-the-art in LLM compression, outperforming all recently-proposed techniques in terms of accuracy at a given compression budget. For instance, when compressing Llama 2 models to 2 bits per parameter, our algorithm quantizes the 7B model to 6.93 perplexity (a 1.29 improvement relative to the best prior work, and 1.81 points from FP16), the 13B model to 5.70 perplexity (a .36 improvement) and the 70B model to 3.94 perplexity (a .22 improvement) on WikiText2. We release our implementation of Additive Quantization for Language Models AQLM as a baseline to facilitate future research in LLM quantization.
翻译:随着精准开源大语言模型(LLMs)的出现,针对这类模型实现在终端设备上执行的量化技术研究竞争日益激烈。本文从多码本量化(MCQ)经典方法的视角,重新审视了"极限"LLM压缩问题——即以极低比特数(如每参数2至3比特)为目标。我们的工作基于加法量化(源自MCQ族经典算法)并对其进行改进,使其适用于语言模型的量化。所提出的算法在LLM压缩领域取得了当前最优结果,在给定压缩预算下,其精度优于所有近期提出的技术。例如,在将Llama 2模型压缩至每参数2比特时,我们的算法在WikiText2数据集上将7B模型的困惑度降至6.93(较先前最优方法提升1.29,与FP16相差1.81),13B模型降至5.70(提升0.36),70B模型降至3.94(提升0.22)。我们发布了语言模型加法量化(AQLM)的实现代码作为基线,以促进LLM量化领域的未来研究。