Low-Rank Adaptation (LoRA) is a widely-used parameter-efficient finetuning method for large language models. LoRA saves memory by training only low rank perturbations to selected weight matrices. In this work, we compare the performance of LoRA and full finetuning on two target domains, programming and mathematics. We consider both the instruction finetuning ($\approx$100K prompt-response pairs) and continued pretraining ($\approx$10B unstructured tokens) data regimes. Our results show that, in most settings, LoRA substantially underperforms full finetuning. Nevertheless, LoRA exhibits a desirable form of regularization: it better maintains the base model's performance on tasks outside the target domain. We show that LoRA provides stronger regularization compared to common techniques such as weight decay and dropout; it also helps maintain more diverse generations. We show that full finetuning learns perturbations with a rank that is 10-100X greater than typical LoRA configurations, possibly explaining some of the reported gaps. We conclude by proposing best practices for finetuning with LoRA.
翻译:低秩适配(LoRA)是一种广泛用于大语言模型的参数高效微调方法。LoRA通过仅对选定的权重矩阵训练低秩扰动来节省内存。在本工作中,我们比较了LoRA与全参数微调在编程和数学两个目标领域的性能。我们考虑指令微调(约10万条提示-响应对)和持续预训练(约100亿非结构化令牌)两种数据场景。结果表明,在大多数设置下,LoRA的表现明显不如全参数微调。然而,LoRA展现了一种可取的规范化特性:它能更好地保持基础模型在目标领域外任务上的性能。我们证明,与权重衰减和丢弃法等常见技术相比,LoRA提供了更强的规范化效果,同时也有助于维持生成的多样性。我们发现全参数微调学习了比典型LoRA配置高10-100倍的秩扰动,这可能解释了部分报告中的性能差距。最后,我们提出了使用LoRA进行微调的最佳实践建议。