After discovering that Language Models (LMs) can be good in-context few-shot learners, numerous strategies have been proposed to optimize in-context sequence configurations. Recently, researchers in Vision-Language (VL) domains also develop their few-shot learners, while they only use the simplest way, i.e., randomly sampling, to configure in-context image-text pairs. In order to explore the effects of varying configurations on VL in-context learning, we devised four strategies for image selection and four for caption assignment to configure in-context image-text pairs for image captioning. Here Image Captioning is used as the case study since it can be seen as the visually-conditioned LM. Our comprehensive experiments yield two counter-intuitive but valuable insights, highlighting the distinct characteristics of VL in-context learning due to multi-modal synergy, as compared to the NLP case.
翻译:在发现语言模型可以作为优秀的上下文小样本学习器后,研究者提出了多种策略来优化上下文序列配置。最近,视觉-语言领域的研究者也开发了各自的小样本学习器,但仅采用最简单的随机采样方式来配置上下文图像-文本对。为探究不同配置对视觉-语言上下文学习的影响,我们设计了四种图像选择策略和四种字幕分配策略,用于构建图像字幕任务的上下文图像-文本对。以图像字幕作为案例研究(因其可视为视觉条件语言模型),我们通过全面实验获得了两个反直觉却具有价值的发现,揭示了由于多模态协同作用,视觉-语言上下文学习相比自然语言处理场景具有独特特性。