QLoRA: Efficient Finetuning of Quantized LLMs

We present QLoRA, an efficient finetuning approach that reduces memory usage enough to finetune a 65B parameter model on a single 48GB GPU while preserving full 16-bit finetuning task performance. QLoRA backpropagates gradients through a frozen, 4-bit quantized pretrained language model into Low Rank Adapters~(LoRA). Our best model family, which we name Guanaco, outperforms all previous openly released models on the Vicuna benchmark, reaching 99.3% of the performance level of ChatGPT while only requiring 24 hours of finetuning on a single GPU. QLoRA introduces a number of innovations to save memory without sacrificing performance: (a) 4-bit NormalFloat (NF4), a new data type that is information theoretically optimal for normally distributed weights (b) double quantization to reduce the average memory footprint by quantizing the quantization constants, and (c) paged optimziers to manage memory spikes. We use QLoRA to finetune more than 1,000 models, providing a detailed analysis of instruction following and chatbot performance across 8 instruction datasets, multiple model types (LLaMA, T5), and model scales that would be infeasible to run with regular finetuning (e.g. 33B and 65B parameter models). Our results show that QLoRA finetuning on a small high-quality dataset leads to state-of-the-art results, even when using smaller models than the previous SoTA. We provide a detailed analysis of chatbot performance based on both human and GPT-4 evaluations showing that GPT-4 evaluations are a cheap and reasonable alternative to human evaluation. Furthermore, we find that current chatbot benchmarks are not trustworthy to accurately evaluate the performance levels of chatbots. A lemon-picked analysis demonstrates where Guanaco fails compared to ChatGPT. We release all of our models and code, including CUDA kernels for 4-bit training.

翻译：我们提出了QLoRA，一种高效微调方法，能够将内存占用降低至足以在单张48GB GPU上微调65B参数模型，同时保持完整的16位微调任务性能。QLoRA通过冻结的4位量化预训练语言模型，将梯度反向传播至低秩适配器（LoRA）。我们最优的模型系列命名为Guanaco，在Vicuna基准测试中超越了所有先前公开的模型，达到ChatGPT性能水平的99.3%，且仅需在单张GPU上微调24小时。QLoRA引入了多项创新以在不牺牲性能的前提下节省内存：(a) 4位NormalFloat（NF4），一种在信息论上对正态分布权重最优的新数据类型；(b) 双重量化，通过对量化常数进行量化来降低平均内存占用；(c) 分页优化器，用于管理内存峰值。我们使用QLoRA微调了超过1000个模型，对8个指令数据集、多种模型类型（LLaMA、T5）及模型规模（例如33B和65B参数模型）上的指令遵循与聊天机器人性能进行了详细分析，而常规微调无法处理如此大规模的实验。结果表明，即使在模型规模小于先前最先进方法的情况下，使用小型高质量数据集进行QLoRA微调也能取得最先进的成果。我们基于人类和GPT-4评估提供了详细的聊天机器人性能分析，证明GPT-4评估是廉价且合理的人类评估替代方案。此外，我们发现当前聊天机器人基准测试无法可靠地准确评估聊天机器人的性能水平。柠檬精选分析揭示了Guanaco相比ChatGPT的不足之处。我们已开源所有模型及代码，包括用于4位训练的CUDA内核。