Foundation models (FMs) are pre-trained on large-scale datasets and then fine-tuned on a downstream task for a specific application. The most successful and most commonly used fine-tuning method is to update the pre-trained weights via a low-rank adaptation (LoRA). LoRA introduces new weight matrices that are usually initialized at random with a uniform rank distribution across the model weights. Recent works focus on different initialization schemes or the learning of adaptive ranks during fine-tuning. Both approaches have only been investigated in isolation, resulting in slow convergence or a uniform rank distribution, in turn leading to suboptimal performance. We propose to improve LoRA by initializing the new weights in a data-driven manner by computing singular value decomposition (SVD) on minibatches of activation vectors. Then, we initialize the LoRA matrices with the obtained right-singular vectors and redistribute ranks among all weight matrices to provably store the maximum amount of information of the downstream data in the newly introduced weights. In this way, only what information to maintain or neglect during the fine-tuning process needs to be learned. We call our new method Explained Variance Adaptation (EVA). We apply EVA to a variety of fine-tuning tasks ranging from language generation and understanding to image classification and reinforcement learning. EVA exhibits faster convergence than competitors and achieves the highest average score across a multitude of tasks per domain while reducing the number of trainable parameters through rank redistribution.
翻译:基础模型(FMs)在大规模数据集上进行预训练,随后针对特定应用在下游任务上进行微调。最成功且最常用的微调方法是通过低秩自适应(LoRA)更新预训练权重。LoRA引入新的权重矩阵,这些矩阵通常以随机方式初始化,并在模型权重间采用均匀的秩分布。近期研究聚焦于不同的初始化方案或在微调过程中学习自适应秩。这两种方法目前仅在孤立状态下被研究,导致收敛速度缓慢或产生均匀秩分布,进而引发次优性能。我们提出通过数据驱动的方式初始化新权重以改进LoRA:对激活向量的最小批次计算奇异值分解(SVD),随后用获得的右奇异向量初始化LoRA矩阵,并在所有权重矩阵间重新分配秩,从而可证明地在新引入权重中存储下游数据的最大信息量。通过这种方式,仅需学习在微调过程中应保留或忽略哪些信息。我们将这一新方法命名为解释方差自适应(EVA)。我们将EVA应用于从语言生成与理解到图像分类和强化学习等多种微调任务。实验表明,EVA相比现有方法具有更快的收敛速度,通过在各个领域内对多项任务的平均得分评估均达到最高水平,同时通过秩重分配减少了可训练参数数量。