M$^2$IST: Multi-Modal Interactive Side-Tuning for Efficient Referring Expression Comprehension

Referring expression comprehension (REC) is a vision-language task to locate a target object in an image based on a language expression. Fully fine-tuning general-purpose pre-trained vision-language foundation models for REC yields impressive performance but becomes increasingly costly. Parameter-efficient transfer learning (PETL) methods have shown strong performance with fewer tunable parameters. However, directly applying PETL to REC faces two challenges: (1) insufficient multi-modal interaction between pre-trained vision-language foundation models, and (2) high GPU memory usage due to gradients passing through the heavy vision-language foundation models. To this end, we present M$^2$IST: Multi-Modal Interactive Side-Tuning with M$^3$ISAs: Mixture of Multi-Modal Interactive Side-Adapters. During fine-tuning, we keep the pre-trained uni-modal encoders fixed, updating M$^3$ISAs on side networks to progressively connect them, enabling more comprehensive vision-language alignment and efficient tuning for REC. Empirical results reveal that M$^2$IST achieves an optimal balance between performance and efficiency compared to most full fine-tuning and other PETL methods. With M$^2$IST, standard transformer-based REC methods present competitive or even superior performance compared to full fine-tuning, while utilizing only 2.11\% of the tunable parameters, 39.61\% of the GPU memory, and 63.46\% of the fine-tuning time required for full fine-tuning.

翻译：指代表达理解是一项根据语言表达在图像中定位目标对象的视觉-语言任务。对通用预训练视觉-语言基础模型进行全参数微调以用于REC任务，虽能获得优异性能，但计算成本日益高昂。参数高效迁移学习方法已展现出以较少可调参数实现强大性能的潜力。然而，直接将PETL应用于REC面临两大挑战：（1）预训练视觉-语言基础模型间多模态交互不足；（2）梯度流经庞大的视觉-语言基础模型导致GPU内存占用过高。为此，我们提出M$^2$IST：基于M$^3$ISA（多模态交互式侧边适配器混合模块）的**多模态交互式侧边调优**方法。在微调过程中，我们固定预训练的单模态编码器，通过在侧边网络更新M$^3$ISA模块逐步连接视觉与语言分支，从而实现更全面的视觉-语言对齐，并为REC任务实现高效调优。实验结果表明，与大多数全参数微调及其他PETL方法相比，M$^2$IST在性能与效率间达到了最优平衡。采用M$^2$IST后，基于标准Transformer的REC方法仅需全参数微调2.11%的可调参数、39.61%的GPU内存和63.46%的微调时间，即可实现与之相当甚至更优的性能。

相关内容

IST

关注 0

《信息与软件技术》是一本国际档案期刊，主要关注有助于改进软件开发实践的研究和经验。该杂志的范围包括更好地设计软件和管理其开发的方法和技术。提交审查的文章应该有一个明确的软件工程的组成部分，或者说明如何改进软件开发的工程和管理。官网地址： http://dblp.uni-trier.de/db/journals/infsof/index.html

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日