In this paper, we show that Low Rank Adaptation (LoRA) as originally introduced in Hu et al. (2021) leads to suboptimal finetuning of models with large width (embedding dimension). This is due to the fact that adapter matrices A and B in LoRA are updated with the same learning rate. Using scaling arguments for large width networks, we demonstrate that using the same learning rate for A and B does not allow efficient feature learning. We then show that this suboptimality of LoRA can be corrected simply by setting different learning rates for the LoRA adapter matrices A and B with a well-chosen ratio. We call this proposed algorithm LoRA$+$. In our extensive experiments, LoRA$+$ improves performance (1-2 $\%$ improvements) and finetuning speed (up to $\sim$ 2X SpeedUp), at the same computational cost as LoRA.
翻译:本文指出,Hu等人(2021)最初提出的低秩自适应(LoRA)方法在微调大宽度(嵌入维度)模型时会导致次优结果。这是由于LoRA中的适配器矩阵A和B以相同的学习率进行更新。通过对大宽度网络的缩放论证,我们证明了为A和B使用相同学习率无法实现高效的特征学习。进一步研究表明,只需为LoRA适配器矩阵A和B设置具有恰当比例的不同学习率,即可修正LoRA的这种次优性。我们将该改进算法命名为LoRA$+$。在大量实验中,LoRA$+$在保持与LoRA相同计算成本的前提下,显著提升了模型性能(提升1-2$\%$)与微调速度(最高可达$\sim$2倍加速)。