It is a common practice in natural language processing to pre-train a single model on a general domain and then fine-tune it for downstream tasks. However, when it comes to Large Language Models, fine-tuning the entire model can be computationally expensive, resulting in very intensive energy consumption. As a result, several Parameter Efficient Fine-Tuning (PEFT) approaches were recently proposed. One of the most popular approaches is low-rank adaptation (LoRA), where the key insight is decomposing the update weights of the pre-trained model into two low-rank matrices. However, the proposed approaches either use the same rank value across all different weight matrices, which has been shown to be a sub-optimal choice, or do not use any quantization technique, one of the most important factors when it comes to a model's energy consumption. In this work, we propose Bayesian-LoRA which approaches low-rank adaptation and quantization from a Bayesian perspective by employing a prior distribution on both quantization levels and rank values. As a result, B-LoRA is able to fine-tune a pre-trained model on a specific downstream task, finding the optimal rank values and quantization levels for every low-rank matrix. We validate the proposed model by fine-tuning a pre-trained DeBERTaV3 on the GLUE benchmark. Moreover, we compare it to relevant baselines and present both qualitative and quantitative results, showing how the proposed approach is able to learn optimal-rank quantized matrices. B-LoRA performs on par with or better than the baselines while reducing the total number of bit operations by roughly 70% compared to the baseline methods.
翻译:在自然语言处理中,通常的做法是在通用领域预训练单一模型,随后针对下游任务进行微调。然而,对于大语言模型而言,对整个模型进行微调的计算成本高昂,会导致极高的能耗。因此,近期提出了多种参数高效微调方法。其中最流行的方法之一是低秩自适应,其核心思想是将预训练模型的更新权重分解为两个低秩矩阵。然而,现有方法要么对所有不同的权重矩阵使用相同的秩值(这已被证明是次优选择),要么未采用量化技术(而量化是影响模型能耗的关键因素之一)。本研究提出贝叶斯-LoRA,该方法从贝叶斯视角处理低秩自适应与量化问题,通过对量化层级和秩值施加先验分布来实现。因此,B-LoRA能够在特定下游任务上微调预训练模型,并为每个低秩矩阵寻得最优秩值与量化层级。我们通过在GLUE基准上微调预训练的DeBERTaV3模型来验证所提方法。此外,我们将其与相关基线模型进行比较,并呈现定性与定量结果,展示所提方法如何学习最优秩量化矩阵。B-LoRA在性能上与基线模型相当或更优,同时将总比特运算量较基线方法降低约70%。