Fine-tuning large pre-trained language models on downstream tasks has become an important paradigm in NLP. However, common practice fine-tunes all of the parameters in a pre-trained model, which becomes prohibitive when a large number of downstream tasks are present. Therefore, many fine-tuning methods are proposed to learn incremental updates of pre-trained weights in a parameter efficient way, e.g., low-rank increments. These methods often evenly distribute the budget of incremental updates across all pre-trained weight matrices, and overlook the varying importance of different weight parameters. As a consequence, the fine-tuning performance is suboptimal. To bridge this gap, we propose AdaLoRA, which adaptively allocates the parameter budget among weight matrices according to their importance score. In particular, AdaLoRA parameterizes the incremental updates in the form of singular value decomposition. Such a novel approach allows us to effectively prune the singular values of unimportant updates, which is essentially to reduce their parameter budget but circumvent intensive exact SVD computations. We conduct extensive experiments with several pre-trained models on natural language processing, question answering, and natural language generation to validate the effectiveness of AdaLoRA. Results demonstrate that AdaLoRA manifests notable improvement over baselines, especially in the low budget settings. Our code is publicly available at https://github.com/QingruZhang/AdaLoRA .
翻译:在下游任务上对大型预训练语言模型进行微调已成为自然语言处理领域的重要范式。然而,常规实践中需微调预训练模型的所有参数,当面临大量下游任务时这一方法变得不可行。因此,许多微调方法被提出以参数高效的方式(如低秩增量)学习预训练权重的增量更新。这些方法通常将增量更新的预算均匀分配给所有预训练权重矩阵,却忽视了不同权重参数的重要性差异,导致微调性能次优。为弥补这一不足,我们提出AdaLoRA方法,其能根据权重矩阵的重要性评分自适应分配参数预算。具体而言,AdaLoRA采用奇异值分解形式对增量更新进行参数化。这种创新方法使我们能够有效剪枝非重要更新的奇异值,本质上是减少其参数预算,同时规避密集的精确SVD计算。我们使用多种预训练模型在自然语言处理、问答和自然语言生成任务上开展大量实验,验证了AdaLoRA的有效性。结果表明,AdaLoRA相比基线方法展现出显著改进,尤其在低预算场景下表现突出。我们的代码已开源至 https://github.com/QingruZhang/AdaLoRA 。