In-Context Learning (ICL) has become a powerful mechanism for adapting Large Language Models (LLMs) to new tasks without fine-tuning. Extending this concept to Large Multimodal Models (LMMs), Multimodal In-Context Learning (M-ICL) relies on retrieving relevant examples, such as images, captions, or question-answer pairs, to guide predictions across tasks like classification, captioning, and visual question answering (VQA). Most existing approaches select in-context examples based on feature-space similarity, assuming that semantically similar samples provide the most useful context. However, our systematic analysis reveals that this assumption does not always hold: visually similar examples are not necessarily those that most effectively enhance in-context learning performance. To address this, we propose the Guided Retrieval of In-context Prompts (GRIP), a learnable vision-only retrieval framework that leverages feedback from LMMs to identify examples that truly improve model predictions. GRIP learns to distinguish beneficial from detrimental in-context examples through contrastive training, refining retrieval beyond pure similarity. Across three multimodal tasks, namely classification, captioning, and VQA, GRIP improves consistently over similarity-based retrieval on Qwen2.5-VL-7B, with its strongest gains in classification on Idefics2-8B. Moreover, we demonstrate that retrievers trained with feedback from one open LMM can be transferred to other models without retraining, including closed-source GPT-4o and Gemini, enabling scalable and cost-efficient deployment of M-ICL. Code will be published upon acceptance.
翻译:摘要:上下文学习已成为在不进行微调的情况下使大型语言模型适应新任务的有力机制。将该概念扩展至大型多模态模型后,多模态上下文学习依赖检索相关示例(如图像、标题或问答对)来指导分类、标题生成和视觉问答等任务的预测。现有方法大多基于特征空间相似性选择上下文示例,假设语义相似的样本能提供最有效的上下文。然而,我们的系统分析表明,这一假设并非总是成立:视觉相似的示例未必能最有效地提升上下文学习性能。为解决此问题,我们提出上下文提示的引导式检索方法——一种可学习的纯视觉检索框架,通过利用来自多模态模型的反馈来识别真正能改进模型预测的示例。GRIP通过对比训练学习区分有益与有害的上下文示例,从而将检索优化到超越纯相似性层面。在分类、标题生成和视觉问答这三项多模态任务中,GRIP在Qwen2.5-VL-7B模型上始终优于基于相似性的检索方法,并在Idefics2-8B模型的分类任务上取得最大提升。此外,我们证明:使用一个开放多模态模型的反馈训练的检索器可迁移至其他模型(包括闭源的GPT-4o和Gemini)而无需重新训练,从而实现可扩展且高效的多模态上下文学习部署。代码将在论文接收后公开。