One Initialization to Rule them All: Fine-tuning via Explained Variance Adaptation

Foundation models (FMs) are pre-trained on large-scale datasets and then fine-tuned on a downstream task for a specific application. The most successful and most commonly used fine-tuning method is to update the pre-trained weights via a low-rank adaptation (LoRA). LoRA introduces new weight matrices that are usually initialized at random with a uniform rank distribution across the model weights. Recent works focus on different initialization schemes or the learning of adaptive ranks during fine-tuning. Both approaches have only been investigated in isolation, resulting in slow convergence or a uniform rank distribution, in turn leading to suboptimal performance. We propose to improve LoRA by initializing the new weights in a data-driven manner by computing singular value decomposition (SVD) on minibatches of activation vectors. Then, we initialize the LoRA matrices with the obtained right-singular vectors and redistribute ranks among all weight matrices to provably store the maximum amount of information of the downstream data in the newly introduced weights. In this way, only what information to maintain or neglect during the fine-tuning process needs to be learned. We call our new method $\textbf{E}$xplained $\textbf{V}$ariance $\textbf{A}$daptation (EVA). We apply EVA to a variety of fine-tuning tasks ranging from language generation and understanding to image classification and reinforcement learning. EVA exhibits faster convergence than competitors and achieves the highest average score across a multitude of tasks per domain while reducing the number of trainable parameters through rank redistribution.

翻译：基础模型（FMs）在大规模数据集上进行预训练，随后针对特定应用在下游任务上进行微调。最成功且最常用的微调方法是通过低秩自适应（LoRA）更新预训练权重。LoRA引入了新的权重矩阵，这些矩阵通常以随机方式初始化，并在模型权重中具有均匀的秩分布。近期研究聚焦于不同的初始化方案或在微调过程中学习自适应秩。这两种方法仅在孤立情况下被研究，导致收敛速度缓慢或产生均匀的秩分布，进而造成次优性能。我们提出通过数据驱动的方式初始化新权重来改进LoRA，具体方法是在激活向量的小批量数据上计算奇异值分解（SVD）。然后，我们使用得到的右奇异向量初始化LoRA矩阵，并在所有权重矩阵间重新分配秩，以可证明的方式在下游数据的新引入权重中存储最大信息量。通过这种方式，只需学习在微调过程中应保留或忽略哪些信息。我们将这一新方法称为$\textbf{E}$xplained $\textbf{V}$ariance $\textbf{A}$daptation（EVA）。我们将EVA应用于从语言生成与理解到图像分类和强化学习等多种微调任务。EVA展现出比竞争方法更快的收敛速度，并在每个领域的多项任务中取得最高平均得分，同时通过秩重新分配减少了可训练参数的数量。