We propose a memory-efficient finetuning algorithm for large language models (LLMs) that supports finetuning LLMs with 65B parameters in 3-bit or 4-bit precision on as little as one 48GB GPU. Our method, modular low-rank adaptation (ModuLoRA), integrates any user-specified weight quantizer with finetuning via low-rank adapters (LoRAs). Our approach relies on a simple quantization-agnostic backward pass that adaptively materializes low-precision LLM weights from a custom black-box quantization module. This approach enables finetuning 3-bit LLMs for the first time--leveraging state-of-the-art 3-bit OPTQ quantization often outperforms finetuning that relies on less sophisticated 4-bit and 8-bit methods. In our experiments, ModuLoRA attains competitive performance on text classification, natural language infernece, and instruction following tasks using significantly less memory than existing approaches, and we also surpass the state-of-the-art ROUGE score on a popular summarization task. We release ModuLoRA together with a series of low-precision models--including the first family of 3-bit instruction following Alpaca LLMs--as part of LLMTOOLS, a user-friendly library for quantizing, running, and finetuning LLMs on consumer GPUs.
翻译:我们提出了一种内存高效的大语言模型微调算法,支持在低至一块48GB GPU上以3位或4位精度微调具有650亿参数的模型。我们的方法——模块化低秩适配(ModuLoRA),将用户指定的任意权重量化器与基于低秩适配器(LoRAs)的微调相结合。该方法依赖于一种简单的量化无关反向传播过程,该过程从自定义的黑盒量化模块中自适应地物化低精度的大语言模型权重。这一技术首次实现了对3位大语言模型的微调——利用最先进的3位OPTQ量化,其性能通常优于依赖较简单4位和8位方法的微调方案。在实验中,ModuLoRA在文本分类、自然语言推理和指令遵循任务上以显著低于现有方法的内存占用量达到了竞争性性能,并在一个流行的摘要任务上超越了最先进的ROUGE分数。我们将ModuLoRA与一系列低精度模型(包括首个3位指令遵循Alpaca大语言模型家族)作为LLMTOOLS的一部分发布,LLMTOOLS是一个面向消费级GPU的易用型大语言模型量化、运行和微调库。