The LoRA-finetuning quantization of LLMs has been extensively studied to obtain accurate yet compact LLMs for deployment on resource-constrained hardware. However, existing methods cause the quantized LLM to severely degrade and even fail to benefit from the finetuning of LoRA. This paper proposes a novel IR-QLoRA for pushing quantized LLMs with LoRA to be highly accurate through information retention. The proposed IR-QLoRA mainly relies on two technologies derived from the perspective of unified information: (1) statistics-based Information Calibration Quantization allows the quantized parameters of LLM to retain original information accurately; (2) finetuning-based Information Elastic Connection makes LoRA utilizes elastic representation transformation with diverse information. Comprehensive experiments show that IR-QLoRA can significantly improve accuracy across LLaMA and LLaMA2 families under 2-4 bit-widths, e.g., 4- bit LLaMA-7B achieves 1.4% improvement on MMLU compared with the state-of-the-art methods. The significant performance gain requires only a tiny 0.31% additional time consumption, revealing the satisfactory efficiency of our IR-QLoRA. We highlight that IR-QLoRA enjoys excellent versatility, compatible with various frameworks (e.g., NormalFloat and Integer quantization) and brings general accuracy gains. The code is available at https://github.com/htqin/ir-qlora.
翻译:LLM的LoRA微调量化已被广泛研究,旨在获得精确且紧凑的LLM,以便部署在资源受限的硬件上。然而,现有方法会导致量化后的LLM性能严重下降,甚至无法受益于LoRA的微调。本文提出了一种新颖的IR-QLoRA方法,通过信息保留推动带有LoRA的量化LLM实现高精度。所提出的IR-QLoRA主要依赖于从统一信息视角衍生的两项技术:(1) 基于统计的信息校准量化,使LLM的量化参数能够准确保留原始信息;(2) 基于微调的信息弹性连接,使LoRA能够利用具有多样化信息的弹性表示变换。综合实验表明,IR-QLoRA能在2-4比特位宽下显著提升LLaMA和LLaMA2系列模型的精度,例如,4比特的LLaMA-7B在MMLU基准上相比最先进方法实现了1.4%的性能提升。这一显著性能增益仅需额外0.31%的时间开销,揭示了IR-QLoRA令人满意的效率。我们强调,IR-QLoRA具备出色的通用性,兼容多种量化框架(如NormalFloat和整数量化),并能带来普遍的精度提升。代码发布于https://github.com/htqin/ir-qlora。