After discovering that Language Models (LMs) can be good in-context few-shot learners, numerous strategies have been proposed to optimize in-context sequence configurations. Recently, researchers in Vision-Language (VL) domains also develop their few-shot learners, while they only use the simplest way, ie., randomly sampling, to configure in-context image-text pairs. In order to explore the effects of varying configurations on VL in-context learning, we devised four strategies for image selection and four for caption assignment to configure in-context image-text pairs for image captioning. Here Image Captioning is used as the case study since it can be seen as the visually-conditioned LM. Our comprehensive experiments yield two counter-intuitive but valuable insights, highlighting the distinct characteristics of VL in-context learning due to multi-modal synergy, as compared to the NLP case. Furthermore, in our exploration of optimal combination strategies, we observed an average performance enhancement of 20.9 of CIDEr scores compared to the baseline. The code is given in https://github.com/yongliang-wu/ExploreCfg.
翻译:在发现语言模型(LM)能够成为优秀的上下文少样本学习器后,研究者提出了多种策略来优化上下文序列配置。近年来,视觉-语言(VL)领域的研究者也开发了各自的少样本学习器,但这些方法仅采用最简单的配置方式——随机采样来构建上下文图文对。为探究不同配置策略对VL上下文学习的影响,我们设计了四种图像选择策略和四种描述分配策略,用于构建图像字幕生成的上下文图文对。本研究选择图像字幕生成作为案例,因其可视为视觉条件化的语言模型。通过全面实验,我们获得两个反直觉但极具价值的发现,揭示了多模态协同作用使VL上下文学习呈现出与自然语言处理(NLP)领域截然不同的特性。此外,在探索最优组合策略时,我们观察到相较于基线方法,CIDEr评分平均提升20.9分。相关代码已开源:https://github.com/yongliang-wu/ExploreCfg。