Large-scale language models have shown the ability to adapt to a new task via conditioning on a few demonstrations (i.e., in-context learning). However, in the vision-language domain, most large-scale pre-trained vision-language (VL) models do not possess the ability to conduct in-context learning. How can we enable in-context learning for VL models? In this paper, we study an interesting hypothesis: can we transfer the in-context learning ability from the language domain to VL domain? Specifically, we first meta-trains a language model to perform in-context learning on NLP tasks (as in MetaICL); then we transfer this model to perform VL tasks by attaching a visual encoder. Our experiments suggest that indeed in-context learning ability can be transferred cross modalities: our model considerably improves the in-context learning capability on VL tasks and can even compensate for the size of the model significantly. On VQA, OK-VQA, and GQA, our method could outperform the baseline model while having 20 times fewer parameters.
翻译:大规模语言模型已展现出通过少量示例(即上下文学习)适应新任务的能力。然而在视觉-语言领域,大多数大规模预训练视觉-语言模型并不具备上下文学习能力。如何为视觉-语言模型赋予上下文学习能力?本文研究一个有趣的假设:能否将语言领域的上下文学习能力迁移至视觉-语言领域?具体而言,我们首先对语言模型进行元训练以实现NLP任务上的上下文学习(如MetaICL方法),随后通过附加视觉编码器将该模型迁移至视觉-语言任务。实验表明,上下文学习能力确实可跨模态迁移:我们的模型显著提升了视觉-语言任务的上下文学习能力,甚至可大幅弥补模型规模不足的缺陷。在VQA、OK-VQA和GQA数据集上,本方法在参数量减少20倍的情况下仍能超越基线模型。