Fine-tuning is a crucial process for adapting large language models (LLMs) to diverse applications. In certain scenarios, such as multi-tenant serving, deploying multiple LLMs becomes necessary to meet complex demands. Recent studies suggest decomposing a fine-tuned LLM into a base model and corresponding delta weights, which are then compressed using low-rank or low-bit approaches to reduce costs. In this work, we observe that existing low-rank and low-bit compression methods can significantly harm the model performance for task-specific fine-tuned LLMs (e.g., WizardMath for math problems). Motivated by the long-tail distribution of singular values in the delta weights, we propose a delta quantization approach using mixed-precision. This method employs higher-bit representation for singular vectors corresponding to larger singular values. We evaluate our approach on various fine-tuned LLMs, including math LLMs, code LLMs, chat LLMs, and even VLMs. Experimental results demonstrate that our approach performs comparably to full fine-tuned LLMs, surpassing both low-rank and low-bit baselines by a considerable margin. Additionally, we show that our method is compatible with various backbone LLMs, such as Llama-2, Llama-3, and Mistral, highlighting its generalizability.
翻译:微调是使大语言模型适应多样化应用的关键过程。在多租户服务等特定场景中,为满足复杂需求往往需要部署多个大语言模型。近期研究提出将微调后的大语言模型分解为基础模型与对应的增量权重,并采用低秩或低位宽方法对增量权重进行压缩以降低成本。本研究发现,现有低秩与低位宽压缩方法会显著损害任务专用微调大语言模型的性能。受增量权重奇异值长尾分布特性的启发,我们提出一种混合精度的增量量化方法。该方法对较大奇异值对应的奇异向量采用高位宽表示。我们在多种微调大语言模型上评估了该方法,包括数学大语言模型、代码大语言模型、对话大语言模型乃至视觉语言模型。实验结果表明,该方法性能与完整微调大语言模型相当,显著超越低秩与低位宽基线方法。此外,我们证明该方法可兼容多种骨干大语言模型,突显其泛化能力。