After discovering that Language Models (LMs) can be good in-context few-shot learners, numerous strategies have been proposed to optimize in-context sequence configurations. Recently, researchers in Vision-Language (VL) domains also develop their few-shot learners, while they only use the simplest way, ie., randomly sampling, to configure in-context image-text pairs. In order to explore the effects of varying configurations on VL in-context learning, we devised four strategies for image selection and four for caption assignment to configure in-context image-text pairs for image captioning. Here Image Captioning is used as the case study since it can be seen as the visually-conditioned LM. Our comprehensive experiments yield two counter-intuitive but valuable insights, highlighting the distinct characteristics of VL in-context learning due to multi-modal synergy, as compared to the NLP case. Furthermore, in our exploration of optimal combination strategies, we observed an average performance enhancement of 20.9 of CIDEr scores compared to the baseline. The code is given in https://github.com/yongliang-wu/ExploreCfg.
翻译:在发现语言模型(LMs)能够成为优秀的上下文少样本学习器后,研究者提出了多种策略来优化上下文序列配置。近年来,视觉-语言(VL)领域的研究者也开发了他们的少样本学习器,但这些方法仅采用最简单的配置方式(即随机采样)来构建上下文图像-文本对。为探究不同配置对VL上下文学习的影响,我们针对图像描述任务设计了四种图像选择策略和四种文本分配策略,用于配置上下文图像-文本对。本研究选择图像描述作为案例,因其可视为视觉条件化的语言模型。通过综合实验,我们得到了两个反直觉但极具价值的发现,揭示了多模态协同作用下VL上下文学习与NLP场景的本质差异。此外,在探索最优组合策略时,我们观察到相较于基线方法,CIDEr评分平均提升了20.9分。相关代码已开源至https://github.com/yongliang-wu/ExploreCfg。