In this work we study the enhancement of Low Rank Adaptation (LoRA) fine-tuning procedure by introducing a Riemannian preconditioner in its optimization step. Specifically, we introduce an $r\times r$ preconditioner in each gradient step where $r$ is the LoRA rank. This preconditioner requires a small change to existing optimizer code and creates virtually minuscule storage and runtime overhead. Our experimental results with both large language models and text-to-image diffusion models show that with our preconditioner, the convergence and reliability of SGD and AdamW can be significantly enhanced. Moreover, the training process becomes much more robust to hyperparameter choices such as learning rate. Theoretically, we show that fine-tuning a two-layer ReLU network in the convex paramaterization with our preconditioner has convergence rate independent of condition number of the data matrix. This new Riemannian preconditioner, previously explored in classic low-rank matrix recovery, is introduced to deep learning tasks for the first time in our work. We release our code at https://github.com/pilancilab/Riemannian_Preconditioned_LoRA.
翻译:本文研究通过在低秩自适应(LoRA)微调过程的优化步骤中引入黎曼预条件来增强其性能。具体地,我们在每个梯度步骤中引入一个$r\times r$的预条件器,其中$r$为LoRA秩。该预条件器仅需对现有优化器代码进行微小改动,并产生几乎可忽略的存储与运行时开销。我们在大型语言模型和文本到图像扩散模型上的实验结果表明,采用我们的预条件器后,SGD与AdamW的收敛性与可靠性显著提升。此外,训练过程对学习率等超参数选择变得更为稳健。理论方面,我们证明了采用预条件器的凸参数化双层ReLU网络微调,其收敛速度与数据矩阵条件数无关。这种新型黎曼预条件器此前在经典低秩矩阵恢复领域已有探索,而本研究首次将其引入深度学习任务。我们的代码已开源至https://github.com/pilancilab/Riemannian_Preconditioned_LoRA。