Low-Rank Adaptation (LoRA) enables efficient adaptation of large pre-trained models to downstream tasks by parameterizing weight updates with low-rank matrices. In this paper, we investigate the limitations of the LoRA parameterization from a geometric perspective. Specifically, we show that when a full fine-tuning gradient is backpropagated to the low-rank matrices, it undergoes anisotropic scaling driven by their singular values. We argue that this phenomenon is undesirable because it distorts the full fine-tuning gradient by skewing it toward dominant singular directions while suppressing others. Our analyses demonstrate that anisotropic gradient scaling reduces the effective rank of the low-rank matrices' gradients and results in suboptimal alignment between the full fine-tuning gradient and its low-rank approximation in LoRA, thereby exacerbating the gap to full fine-tuning. To address these limitations, we propose a new low-rank parameterization, SDS-LoRA, which structurally decouples singular values from the backward pass. Our method ensures that the full fine-tuning gradient backpropagates only through the orthonormal bases of the low-rank matrices' subspaces, independent of their scales. Convergence analysis demonstrates that while LoRA's convergence rate degrades with the condition number of the low-rank matrices, SDS-LoRA remains independent of it. Experimental results across natural language and vision benchmarks show that SDS-LoRA improves loss convergence and reduces the gap to full fine-tuning, significantly enhancing adaptation performance.
翻译:低秩适配(Low-Rank Adaptation, LoRA)通过使用低秩矩阵对权重更新进行参数化,实现了大型预训练模型对下游任务的高效适配。本文从几何角度研究了LoRA参数化的局限性。具体而言,我们表明,当全微调梯度反向传播至低秩矩阵时,会因低秩矩阵奇异值驱动而发生各向异性缩放。我们认为该现象不可取,因为它通过将梯度偏向主导奇异方向并抑制其他方向,从而扭曲了全微调梯度。我们的分析表明,各向异性梯度缩放降低了低秩矩阵梯度的有效秩,并导致LoRA中全微调梯度与其低秩近似之间的对齐欠佳,从而加剧了与全微调的差距。为解决这些局限,我们提出了一种新的低秩参数化方法SDS-LoRA,该方法在结构上从反向传播中解耦奇异值。我们的方法确保全微调梯度仅通过低秩矩阵子空间的正交基进行反向传播,而独立于其尺度。收敛性分析表明,虽然LoRA的收敛速率随低秩矩阵的条件数增加而下降,但SDS-LoRA的收敛速率与之无关。在自然语言和视觉基准上的实验结果表明,SDS-LoRA改善了损失收敛性并缩小了与全微调的差距,显著提升了适配性能。