Mainstream parameter-efficient fine-tuning (PEFT) methods, such as LoRA or Adapter, project a model's hidden states to a lower dimension, allowing pre-trained models to adapt to new data through this low-rank bottleneck. However, PEFT tasks involving multiple modalities, like vision-language (VL) tasks, require not only adaptation to new data but also learning the relationship between different modalities. Targeting at VL PEFT tasks, we propose a family of operations, called routing functions, to enhance VL alignment in the low-rank bottlenecks. These feature routing functions adopt linear operations and do not introduce new trainable parameters. In-depth analyses are conducted to study their behavior. In various VL PEFT settings, the routing functions significantly improve performance of the original PEFT methods, achieving over 20\% improvement on VQAv2 ($\text{RoBERTa}_{\text{large}}$+ViT-L/16) and 30\% on COCO Captioning (GPT2-medium+ViT-L/16). Also when fine-tuning a pre-trained multimodal model such as CLIP-BART, we observe smaller but consistent improvements across a range of VL PEFT tasks. Our code is available at https://github.com/tingyu215/Routing_VLPEFT.
翻译:主流的参数高效微调方法,例如LoRA或Adapter,将模型的隐藏状态投影到较低维度,使预训练模型能够通过这一低秩瓶颈适应新数据。然而,涉及多模态(如视觉语言任务)的PEFT任务不仅需要适应新数据,还需学习不同模态间的关系。针对视觉语言PEFT任务,我们提出了一系列称为路由函数的操作,以增强低秩瓶颈中的视觉语言对齐。这些特征路由函数采用线性运算,且不引入新的可训练参数。我们进行了深入分析以研究其行为。在各种视觉语言PEFT设置中,路由函数显著提升了原始PEFT方法的性能,在VQAv2($\text{RoBERTa}_{\text{large}}$+ViT-L/16)上实现了超过20%的改进,在COCO Captioning(GPT2-medium+ViT-L/16)上实现了30%的改进。此外,在对预训练多模态模型(如CLIP-BART)进行微调时,我们在一系列视觉语言PEFT任务中也观察到了较小但一致的性能提升。我们的代码可在https://github.com/tingyu215/Routing_VLPEFT获取。