Large Language Models (LLMs) have showcased remarkable impacts across a wide spectrum of natural language processing tasks. Fine-tuning these pre-trained models on downstream datasets provides further significant performance gains, but this process has been challenging due to its extraordinary resource requirements. To this end, existing efforts focus on parameter-efficient fine-tuning, which, unfortunately, fail to capitalize on the powerful potential of full-parameter fine-tuning. In this work, we propose QFT, a novel Quantized Full-parameter Tuning framework for LLMs that enables memory-efficient fine-tuning without harming performance. Our framework incorporates two novel ideas: (i) we adopt the efficient Lion optimizer, which only keeps track of the momentum and has consistent update magnitudes for each parameter, an inherent advantage for robust quantization; and (ii) we quantize all model states and store them as integer values, and present a gradient flow and parameter update scheme for the quantized weights. As a result, QFT reduces the model state memory to 21% of the standard solution while achieving comparable performance, e.g., tuning a LLaMA-7B model requires only <30GB of memory, satisfied by a single A6000 GPU.
翻译:大语言模型(LLMs)在各类自然语言处理任务中展现出显著影响力。在下游数据集上对这些预训练模型进行微调可进一步提升性能,但该过程因资源需求异常庞大而具有挑战性。为此,现有研究聚焦于参数高效微调,但此类方法无法充分利用全参数微调的强大潜力。本文提出QFT——一种面向LLMs的新型量化全参数调优框架,可在实现内存高效微调的同时不损失性能。该框架包含两项创新设计:(i) 采用高效的Lion优化器,该优化器仅追踪动量并为每个参数提供一致的更新幅度,这天然有利于鲁棒量化;(ii) 量化所有模型状态并存储为整数值,同时提出适用于量化权重的梯度流与参数更新方案。实验表明,QFT将模型状态内存消耗降至标准方案的21%,且达到可比的性能水平——例如对LLaMA-7B模型进行调优仅需不到30GB内存,单张A6000 GPU即可满足需求。