After discovering that Language Models (LMs) can be good in-context few-shot learners, numerous strategies have been proposed to optimize in-context sequence configurations. Recently, researchers in Vision-Language (VL) domains also develop their few-shot learners, while they only use the simplest way, ie., randomly sampling, to configure in-context image-text pairs. In order to explore the effects of varying configurations on VL in-context learning, we devised four strategies for image selection and four for caption assignment to configure in-context image-text pairs for image captioning. Here Image Captioning is used as the case study since it can be seen as the visually-conditioned LM. Our comprehensive experiments yield two counter-intuitive but valuable insights, highlighting the distinct characteristics of VL in-context learning due to multi-modal synergy, as compared to the NLP case. Furthermore, in our exploration of optimal combination strategies, we observed an average performance enhancement of 20.7 of CIDEr scores compared to the baseline. The code is given in https://github.com/yongliang-wu/ExploreCfg.
翻译:在发现语言模型(LMs)能够作为优秀的上下文小样本学习器后,研究者提出了多种策略来优化上下文序列配置。近来,视觉-语言(VL)领域的研究者也开始构建小样本学习器,但仅采用最简单的随机采样方式配置上下文图文对。为探究不同配置策略对VL上下文学习的影响,我们针对图像描述任务设计了四种图像选择策略和四种文本分配策略,用于配置上下文图文对。之所以选择图像描述作为案例研究,是因为该任务可视为视觉条件约束下的语言模型。通过全面实验,我们获得了两个反直觉但具有重要价值的发现,揭示了多模态协同作用下VL上下文学习与自然语言处理(NLP)场景截然不同的特性。此外,在探索最优组合策略时,我们观察到其CIDEr评分相比基线平均提升20.7。相关代码已开源至https://github.com/yongliang-wu/ExploreCfg。