In this paper, we show that Low Rank Adaptation (LoRA) as originally introduced in Hu et al. (2021) leads to suboptimal finetuning of models with large width (embedding dimension). This is due to the fact that adapter matrices A and B in LoRA are updated with the same learning rate. Using scaling arguments for large width networks, we demonstrate that using the same learning rate for A and B does not allow efficient feature learning. We then show that this suboptimality of LoRA can be corrected simply by setting different learning rates for the LoRA adapter matrices A and B with a well-chosen ratio. We call this proposed algorithm LoRA$+$. In our extensive experiments, LoRA$+$ improves performance (1-2 $\%$ improvements) and finetuning speed (up to $\sim$ 2X SpeedUp), at the same computational cost as LoRA.
翻译:本文表明,Hu等人(2021)最初提出的低秩适配(LoRA)方法在微调大宽度(嵌入维度)模型时会导致次优性能。其原因在于LoRA中的适配器矩阵A和B采用相同的学习率进行更新。基于大宽度网络的缩放论证,我们证明对A和B使用相同学习率无法实现高效的特征学习。进一步分析表明,只需为LoRA适配器矩阵A和B设置不同学习率,并选取恰当的比例,即可纠正LoRA的次优性。我们将所提算法命名为LoRA$+$。在大量实验中,LoRA$+$在保持与LoRA相同计算成本的前提下,提升了性能(1-2%的提升)和微调速度(最高可达约2倍加速)。