The increase in parameter size of multimodal large language models (MLLMs) introduces significant capabilities, particularly in-context learning, where MLLMs enhance task performance without updating pre-trained parameters. This effectiveness, however, hinges on the appropriate selection of in-context examples, a process that is currently biased towards visual data, overlooking textual information. Furthermore, the area of supervised retrievers for MLLMs, crucial for optimal in-context example selection, continues to be uninvestigated. Our study offers an in-depth evaluation of the impact of textual information on the unsupervised selection of in-context examples in multimodal contexts, uncovering a notable sensitivity of retriever performance to the employed modalities. Responding to this, we introduce a novel supervised MLLM-retriever MSIER that employs a neural network to select examples that enhance multimodal in-context learning efficiency. This approach is validated through extensive testing across three distinct tasks, demonstrating the method's effectiveness. Additionally, we investigate the influence of modalities on our supervised retrieval method's training and pinpoint factors contributing to our model's success. This exploration paves the way for future advancements, highlighting the potential for refined in-context learning in MLLMs through the strategic use of multimodal data.
翻译:随着多模态大语言模型参数规模的扩大,其能力显著增强,特别是在上下文学习方面——MLLMs无需更新预训练参数即可提升任务性能。然而,这种效果取决于上下文示例的恰当选择,当前选择过程存在偏向视觉数据而忽视文本信息的问题。此外,针对MLLMs的监督检索器研究领域(对优化上下文示例选择至关重要)仍未被探索。本研究深入评估了文本信息在多模态环境下对无监督上下文示例选择的影响,发现检索器性能对所采用模态具有显著敏感性。为此,我们提出了一种新型监督式MLLM检索器MSIER,该模型利用神经网络选择能够提升多模态上下文学习效率的示例。通过在三个不同任务上的广泛测试验证了该方法的有效性。此外,我们探究了模态对监督检索方法训练的影响,并确定了模型成功的关键因素。这项探索为未来进展铺平了道路,凸显了通过多模态数据的策略性使用来优化MLLMs上下文学习的潜力。