Fine-tuning large pre-trained language models on downstream tasks has become an important paradigm in NLP. However, common practice fine-tunes all of the parameters in a pre-trained model, which becomes prohibitive when a large number of downstream tasks are present. Therefore, many fine-tuning methods are proposed to learn incremental updates of pre-trained weights in a parameter efficient way, e.g., low-rank increments. These methods often evenly distribute the budget of incremental updates across all pre-trained weight matrices, and overlook the varying importance of different weight parameters. As a consequence, the fine-tuning performance is suboptimal. To bridge this gap, we propose AdaLoRA, which adaptively allocates the parameter budget among weight matrices according to their importance score. In particular, AdaLoRA parameterizes the incremental updates in the form of singular value decomposition. Such a novel approach allows us to effectively prune the singular values of unimportant updates, which is essentially to reduce their parameter budget but circumvent intensive exact SVD computations. We conduct extensive experiments with several pre-trained models on natural language processing, question answering, and natural language generation to validate the effectiveness of AdaLoRA. Results demonstrate that AdaLoRA manifests notable improvement over baselines, especially in the low budget settings. Our code is publicly available at https://github.com/QingruZhang/AdaLoRA .
翻译:在下游任务上微调大型预训练语言模型已成为自然语言处理中的重要范式。然而,常见做法是微调预训练模型中的所有参数,当存在大量下游任务时,这种方法变得不可行。因此,许多微调方法被提出,以参数高效的方式学习预训练权重的增量更新,例如低秩增量。这些方法通常将增量更新的预算均匀分配给所有预训练权重矩阵,而忽略了不同权重参数的重要性差异,从而导致微调性能次优。为弥补这一不足,我们提出AdaLoRA,它能根据权重矩阵的重要性得分自适应分配参数预算。具体而言,AdaLoRA以奇异值分解的形式参数化增量更新。这种新颖的方法使我们能够有效剪枝不重要更新的奇异值,本质上削减其参数预算,同时避免密集的精确SVD计算。我们在自然语言处理、问答和自然语言生成任务上使用多个预训练模型进行了大量实验,验证了AdaLoRA的有效性。结果表明,AdaLoRA相比基线方法表现出显著改进,尤其在低预算设置下。我们的代码已在https://github.com/QingruZhang/AdaLoRA 公开。