Low-Rank Adaptation (LoRA) fine-tunes large models by learning low-rank updates on top of frozen weights, dramatically reducing trainable parameters and memory. In this work, we address the gap between training with full steps with low-rank projections (SVDLoRA) and LoRA fine-tuning. We propose LoRSum, a memory-efficient subroutine that closes this gap for gradient descent by casting LoRA optimization as a proximal sub-problem and solving it efficiently with alternating least squares updates, which we prove to be an implicit block power method. We recover several recently proposed preconditioning methods for LoRA as special cases, and show that LoRSum can also be used for updating a low-rank momentum. In order to address full steps with preconditioned gradient descent, we propose a scaled variant of LoRSum that uses structured metrics such as K-FAC and Shampoo, and we show that storing the diagonal of these metrics still allows them to perform well while remaining memory-efficient. Experiments on a synthetic task, CIFAR-100, and language-model fine-tuning on GLUE, SQuAD v2, and WikiText-103, show that our method can match or improve LoRA baselines given modest compute overhead, while avoiding full-matrix SVD projections and retaining LoRA-style parameter efficiency.
翻译:低秩适应(LoRA)通过在冻结权重之上学习低秩更新来微调大模型,显著减少了可训练参数和内存占用。在本工作中,我们解决了使用低秩投影进行全步长训练(SVDLoRA)与LoRA微调之间的差距。我们提出了LoRSum,一种内存高效的子程序,通过将LoRA优化构建为一个近端子问题,并利用交替最小二乘更新高效求解(我们证明其本质为一种隐式块幂法),从而为梯度下降方法弥合了这一差距。我们证明了若干近期提出的LoRA预条件方法是本方法的特例,并表明LoRSum也可用于更新低秩动量。为了处理使用预条件梯度下降的全步长训练,我们提出了一种LoRSum的缩放变体,它利用K-FAC和Shampoo等结构化度量,并证明仅存储这些度量的对角线仍能使其保持良好性能,同时维持内存高效性。在合成任务、CIFAR-100以及GLUE、SQuAD v2和WikiText-103上的语言模型微调实验表明,在适度的计算开销下,我们的方法能够匹配或改进LoRA基线,同时避免了全矩阵SVD投影,并保持了LoRA式的参数效率。