This paper introduces a novel Parameter-Efficient Fine-Tuning (PEFT) framework for multi-modal, multi-task transfer learning with pre-trained language models. PEFT techniques such as LoRA, BitFit and IA3 have demonstrated comparable performance to full fine-tuning of pre-trained models for specific downstream tasks, all while demanding significantly fewer trainable parameters and reduced GPU memory consumption. However, in the context of multi-modal fine-tuning, the need for architectural modifications or full fine-tuning often becomes apparent. To address this we propose Context-PEFT, which learns different groups of adaptor parameters based on the token's domain. This approach enables LoRA-like weight injection without requiring additional architectural changes. Our method is evaluated on the COCO captioning task, where it outperforms full fine-tuning under similar data constraints while simultaneously offering a substantially more parameter-efficient and computationally economical solution.
翻译:本文提出了一种面向预训练语言模型的多模态、多任务迁移学习的新型参数高效微调(PEFT)框架。诸如LoRA、BitFit和IA3等PEFT技术,在显著减少可训练参数和降低GPU内存消耗的同时,已在特定下游任务上展现出与全模型微调相当的性能。然而,在多模态微调场景中,架构修改或全模型微调的需求往往难以避免。为解决这一问题,我们提出Context-PEFT方法,该方法基于词元的所属领域学习不同的适配器参数组。这种策略使得无需额外架构调整即可实现类似LoRA的权重注入。我们在COCO图像描述任务上对该方法进行了评估,结果表明在相似数据约束条件下,该方法不仅优于全模型微调,同时提供了更具参数效率和计算经济性的解决方案。