The emergence of accurate open large language models (LLMs) has led to a race towards performant quantization techniques which can enable their execution on end-user devices. In this paper, we revisit the problem of "extreme" LLM compression-defined as targeting extremely low bit counts, such as 2 to 3 bits per parameter-from the point of view of classic methods in Multi-Codebook Quantization (MCQ). Our algorithm, called AQLM, generalizes the classic Additive Quantization (AQ) approach for information retrieval to advance the state-of-the-art in LLM compression, via two innovations: 1) learned additive quantization of weight matrices in input-adaptive fashion, and 2) joint optimization of codebook parameters across each transformer blocks. Broadly, AQLM is the first scheme that is Pareto optimal in terms of accuracy-vs-model-size when compressing to less than 3 bits per parameter, and significantly improves upon all known schemes in the extreme compression (2bit) regime. In addition, AQLM is practical: we provide fast GPU and CPU implementations of AQLM for token generation, which enable us to match or outperform optimized FP16 implementations for speed, while executing in a much smaller memory footprint.
翻译:随着精确开源大语言模型(LLMs)的出现,性能量化技术的研究竞赛日益激烈,这些技术能够使其在终端设备上运行。本文从多码本量化(MCQ)经典方法的角度,重新审视“极限”LLM压缩问题——即针对极低位宽目标(例如每个参数2至3比特)。我们提出的算法AQLM将信息检索中经典的加性量化(AQ)方法进行推广,通过两项创新推动LLM压缩的技术前沿:1)以输入自适应方式对权重矩阵进行学习型加性量化;2)跨Transformer模块的码本参数联合优化。总体而言,AQLM是首个在压缩至每个参数低于3比特时,在精度与模型大小的权衡关系上达到帕累托最优的方案,并在极端压缩(2比特)领域显著优于所有已知方案。此外,AQLM具备实用性:我们提供了面向令牌生成的快速GPU与CPU实现方案,使其在保持更小内存占用的同时,在运行速度上达到或超越优化的FP16实现。