Parameter-efficient fine-tuning optimizes large, pre-trained foundation models by updating a subset of parameters; in this class, Low-Rank Adaptation (LoRA) is particularly effective. Inspired by an effort to investigate the different roles of LoRA matrices during fine-tuning, this paper characterizes and leverages unexpected asymmetry in the importance of low-rank adapter matrices. Specifically, when updating the parameter matrices of a neural network by adding a product $BA$, we observe that the $B$ and $A$ matrices have distinct functions: $A$ extracts features from the input, while $B$ uses these features to create the desired output. Based on this observation, we demonstrate that fine-tuning $B$ is inherently more effective than fine-tuning $A$, and that a random untrained $A$ should perform nearly as well as a fine-tuned one. Using an information-theoretic lens, we also bound the generalization of low-rank adapters, showing that the parameter savings of exclusively training $B$ improves the bound. We support our conclusions with experiments on RoBERTa, BART-Large, LLaMA-2, and ViTs.
翻译:参数高效微调通过更新部分参数来优化大规模预训练基础模型;其中,低秩适配(LoRA)方法尤为有效。受探索LoRA矩阵在微调过程中不同作用的启发,本文刻画并利用了低秩适配器重要性中意想不到的非对称性。具体而言,当通过添加乘积$BA$来更新神经网络的参数矩阵时,我们观察到$B$和$A$矩阵具有不同的功能:$A$从输入中提取特征,而$B$利用这些特征生成期望的输出。基于这一发现,我们证明微调$B$本质上比微调$A$更有效,并且未经训练的随机$A$其性能应几乎与经过微调的相同。通过信息论的视角,我们给出了低秩适配器泛化性能的界,表明仅训练$B$所带来的参数节省改进了该界。我们在RoBERTa、BART-Large、LLaMA-2和ViTs上的实验支持了上述结论。