Recently, large language models (LLMs) have demonstrated impressive capabilities in dealing with new tasks with the help of in-context learning (ICL). In the study of Large Vision-Language Models (LVLMs), when implementing ICL, researchers usually adopts the naive strategies like fixed demonstrations across different samples, or selecting demonstrations directly via a visual-language embedding model. These methods does not guarantee the configured demonstrations fit the need of the LVLMs. To address this issue, we now propose a novel framework, \underline{d}emonstration \underline{r}etriever for large m\underline{u}lti-modal \underline{m}odel (DRUM), which fine-tunes the visual-language embedding model to better meet the LVLM's needs. First, we discuss the retrieval strategies for a visual-language task, assuming an embedding model is given. And we propose to concate the image and text embeddings to enhance the retrieval performance. Second, we propose to re-rank the demonstrations retrieved by the embedding model via the LVLM's feedbacks, and calculate a list-wise ranking loss for training the embedding model. Third, we propose an iterative demonstration mining strategy to improve the training of the embedding model. Through extensive experiments on 3 types of visual-language tasks, 7 benchmark datasets, our DRUM framework is proven to be effective in boosting the LVLM's in-context learning performance via retrieving more proper demonstrations.
翻译:近年来,大型语言模型(LLMs)在上下文学习(ICL)的辅助下,在处理新任务方面展现出令人印象深刻的能力。在大型视觉语言模型(LVLMs)的研究中,实施ICL时,研究者通常采用固定演示跨样本使用或直接通过视觉语言嵌入模型选择演示等简单策略。这些方法无法保证所配置的演示符合LVLMs的需求。为解决这一问题,我们提出了一种新颖的框架——面向大型多模态模型的演示检索器(DRUM),该框架通过微调视觉语言嵌入模型以更好地满足LVLM的需求。首先,我们在给定嵌入模型的假设下,探讨了视觉语言任务的检索策略,并提出通过拼接图像与文本嵌入来提升检索性能。其次,我们提出利用LVLM的反馈对嵌入模型检索到的演示进行重排序,并计算列表级排序损失以训练嵌入模型。第三,我们提出了一种迭代演示挖掘策略以改进嵌入模型的训练。通过在3类视觉语言任务、7个基准数据集上的大量实验,我们的DRUM框架被证明能通过检索更合适的演示,有效提升LVLM的上下文学习性能。