Although In-Context Learning (ICL) brings remarkable performance gains to Large Language Models (LLMs), the improvements remain lower than fine-tuning on downstream tasks. This paper introduces Multi-Modal In-Context Tuning (MMICT), a novel multi-modal fine-tuning paradigm that boosts multi-modal fine-tuning by fully leveraging the promising ICL capability of multi-modal LLMs (MM-LLMs). We propose the Multi-Modal Hub (M-Hub), a unified module that captures various multi-modal features according to different inputs and objectives. Based on M-Hub, MMICT enables MM-LLMs to learn from in-context visual-guided textual features and subsequently generate outputs conditioned on the textual-guided visual features. Moreover, leveraging the flexibility of M-Hub, we design a variety of in-context demonstrations. Extensive experiments on a diverse range of downstream multi-modal tasks demonstrate that MMICT significantly outperforms traditional fine-tuning strategy and the vanilla ICT method that directly takes the concatenation of all information from different modalities as input.
翻译:虽然上下文学习(ICL)为大语言模型(LLM)带来了显著性能提升,但其在下游任务上的改进效果仍低于微调。本文提出多模态上下文调优(MMICT),这是一种新颖的多模态微调范式,通过充分利用多模态大语言模型(MM-LLM)的ICL能力来增强多模态微调。我们设计了一个统一模块——多模态中枢(M-Hub),该模块能根据不同的输入和目标捕获多样化的多模态特征。基于M-Hub,MMICT使MM-LLM能够从上下文引导的视觉文本特征中学习,并据此生成以文本引导视觉特征为条件的输出。此外,利用M-Hub的灵活性,我们设计了多种上下文示例。在广泛的下游多模态任务上的大量实验表明,MMICT显著优于传统微调策略以及将不同模态所有信息直接拼接输入的基础ICT方法。