Inspired by the success of Large Language Models in dealing with new tasks via In-Context Learning (ICL) in NLP, researchers have also developed Large Vision-Language Models (LVLMs) with ICL capabilities. However, when implementing ICL using these LVLMs, researchers usually resort to the simplest way like random sampling to configure the in-context sequence, thus leading to sub-optimal results. To enhance the ICL performance, in this study, we use Visual Question Answering (VQA) as case study to explore diverse in-context configurations to find the powerful ones. Additionally, through observing the changes of the LVLM outputs by altering the in-context sequence, we gain insights into the inner properties of LVLMs, improving our understanding of them. Specifically, to explore in-context configurations, we design diverse retrieval methods and employ different strategies to manipulate the retrieved demonstrations. Through exhaustive experiments on three VQA datasets: VQAv2, VizWiz, and OK-VQA, we uncover three important inner properties of the applied LVLM and demonstrate which strategies can consistently improve the ICL VQA performance. Our code is provided in: https://github.com/GaryJiajia/OFv2_ICL_VQA.
翻译:受大型语言模型通过自然语言处理中的上下文学习处理新任务成功的启发,研究者已开发出具备上下文学习能力的大型视觉语言模型。然而,在使用这些模型进行上下文学习时,研究者通常采用随机抽样等简单方式配置上下文序列,导致结果欠佳。为提升上下文学习性能,本研究以视觉问答为案例,探索不同的上下文配置以寻找高效模式。此外,通过观察改变上下文序列时模型输出的变化,我们深入洞察其内在特性,增进对其理解。具体而言,为探究上下文配置,我们设计了多种检索方法并采用不同策略操作检索得到的示例。通过在VQAv2、VizWiz和OK-VQA三个视觉问答数据集上的充分实验,我们揭示了所用大型视觉语言模型的三个重要内在特性,并验证了哪些策略能持续提升上下文学习问答性能。我们的代码已开源至:https://github.com/GaryJiajia/OFv2_ICL_VQA