There has been significant interest in "extreme" compression of large language models (LLMs), i.e., to 1-2 bits per parameter, which allows such models to be executed efficiently on resource-constrained devices. Existing work focused on improved one-shot quantization techniques and weight representations; yet, purely post-training approaches are reaching diminishing returns in terms of the accuracy-vs-bit-width trade-off. State-of-the-art quantization methods such as QuIP# and AQLM include fine-tuning (part of) the compressed parameters over a limited amount of calibration data; however, such fine-tuning techniques over compressed weights often make exclusive use of straight-through estimators (STE), whose performance is not well-understood in this setting. In this work, we question the use of STE for extreme LLM compression, showing that it can be sub-optimal, and perform a systematic study of quantization-aware fine-tuning strategies for LLMs. We propose PV-Tuning - a representation-agnostic framework that generalizes and improves upon existing fine-tuning strategies, and provides convergence guarantees in restricted cases. On the practical side, when used for 1-2 bit vector quantization, PV-Tuning outperforms prior techniques for highly-performant models such as Llama and Mistral. Using PV-Tuning, we achieve the first Pareto-optimal quantization for Llama 2 family models at 2 bits per parameter.
翻译:近年来,“极限”压缩大语言模型(即每个参数压缩至1-2比特)引起了广泛关注,这使得此类模型能够在资源受限设备上高效运行。现有研究主要集中于改进一次性量化技术和权重表示方法;然而,纯后训练方法在精度与比特宽度的权衡方面已接近收益递减的临界点。诸如QuIP#和AQLM等先进量化方法会基于有限校准数据对(部分)压缩参数进行微调;但此类针对压缩权重的微调技术通常完全依赖直通估计器,其在该场景下的性能表现尚未得到充分理解。本研究对STE在极限LLM压缩中的适用性提出质疑,证明其可能并非最优选择,并系统性地研究了LLM的量化感知微调策略。我们提出PV-Tuning——一种与表示方法无关的通用框架,该框架不仅泛化并改进了现有微调策略,还在受限情况下提供收敛性保证。在实际应用层面,当用于1-2比特向量量化时,PV-Tuning在Llama和Mistral等高性能模型上超越了现有技术。通过PV-Tuning,我们首次实现了Llama 2系列模型在每参数2比特条件下的帕累托最优量化。