Pre-training Large Language Models (LLMs) on web-scale datasets becomes fundamental for advancing general-purpose AI. In contrast, enhancing their predictive performance on downstream tasks typically involves adapting their knowledge through fine-tuning. Parameter-efficient fine-tuning techniques, such as Low-Rank Adaptation (LoRA), aim to reduce the computational cost of this process by freezing the pre-trained model and updating a smaller number of parameters. In comparison to full fine-tuning, these methods achieve over 99\% reduction in trainable parameter count, depending on the configuration. Unfortunately, such a reduction may prove insufficient as LLMs continue to grow in scale. In this work, we address the previous problem by systematically selecting only a few layers to fine-tune using LoRA or its variants. We argue that not all layers contribute equally to the model adaptation. Leveraging this, we identify the most relevant layers to fine-tune by measuring their contribution to changes in internal representations. Our method is orthogonal to and readily compatible with existing low-rank adaptation techniques. We reduce the trainable parameters in LoRA-based techniques by up to 50\%, while maintaining the predictive performance across different models and tasks. Specifically, on encoder-only architectures, this reduction in trainable parameters leads to a negligible predictive performance drop on the GLUE benchmark. On decoder-only architectures, we achieve a small drop or even improvements in the predictive performance on mathematical problem-solving capabilities and coding tasks. Finally, this effectiveness extends to multimodal models, for which we also observe competitive results relative to fine-tuning with LoRA modules in all layers. Code is available at: https://github.com/c2d-usp/Layer-wise-LoRA-with-CKA
翻译:在Web规模数据集上预训练大型语言模型已成为推进通用人工智能发展的基础。相比之下,提升其在下游任务中的预测性能通常需要通过微调来调整其知识。参数高效微调技术,如低秩适应,旨在通过冻结预训练模型并更新较少的参数来降低这一过程的计算成本。与全参数微调相比,这些方法根据配置可将可训练参数数量减少99%以上。然而,随着LLM规模的持续增长,这种减少可能仍显不足。在本工作中,我们通过系统性地选择仅少数层使用LoRA或其变体进行微调来解决上述问题。我们认为并非所有层对模型适应的贡献均等。基于此,我们通过度量各层对内部表示变化的贡献来识别最相关的微调层。我们的方法与现有低秩适应技术正交且易于兼容。我们在基于LoRA的技术中将可训练参数减少高达50%,同时在不同模型和任务中保持预测性能。具体而言,在仅编码器架构上,这种参数减少在GLUE基准测试中仅导致可忽略的预测性能下降。在仅解码器架构上,我们在数学问题解决能力和编码任务的预测性能上实现了小幅下降甚至提升。最后,这种有效性扩展到多模态模型,相较于在所有层使用LoRA模块进行微调,我们同样观察到具有竞争力的结果。代码发布于:https://github.com/c2d-usp/Layer-wise-LoRA-with-CKA