The low-rank adaptation (LoRA) method can largely reduce the amount of trainable parameters for fine-tuning large language models (LLMs), however, it still requires expensive activation memory to update low-rank weights. Reducing the number of LoRA layers or using activation recomputation could harm the fine-tuning performance or increase the computational overhead. In this work, we present LoRA-FA, a memory-efficient fine-tuning method that reduces the activation memory without performance degradation and expensive recomputation. LoRA-FA chooses to freeze the projection-down weight of $A$ and update the projection-up weight of $B$ in each LoRA layer. It ensures the change of model weight reside in a low-rank space during LLMs fine-tuning, while eliminating the requirement to store full-rank input activations. We conduct extensive experiments across multiple model types (RoBERTa, T5, LLaMA) and model scales. Our results show that LoRA-FA can always achieve close fine-tuning accuracy across different tasks compared to full parameter fine-tuning and LoRA. Furthermore, LoRA-FA can reduce the overall memory cost by up to 1.4$\times$ compared to LoRA.
翻译:低秩适配(LoRA)方法能够显著减少大语言模型微调时的可训练参数量,然而,该方法在更新低秩权重时仍需要昂贵的激活内存。减少LoRA层数或使用激活重计算会损害微调性能或增加计算开销。本文提出LoRA-FA,一种内存高效的微调方法,能够在无性能退化及昂贵重计算的前提下降低激活内存。LoRA-FA选择在每层LoRA中固定投影下采样权重$A$,仅更新投影上采样权重$B$。该方法确保模型权重变化始终位于大语言模型微调过程中的低秩空间,同时消除了存储满秩输入激活的需求。我们在多种模型类型(RoBERTa、T5、LLaMA)及模型规模上开展了广泛实验。结果表明,与全参数微调和LoRA相比,LoRA-FA在不同任务上始终能取得接近的微调精度。此外,与LoRA相比,LoRA-FA可将总体内存开销降低至多1.4倍。