We propose an efficient method to ground pretrained text-only language models to the visual domain, enabling them to process arbitrarily interleaved image-and-text data, and generate text interleaved with retrieved images. Our method leverages the abilities of language models learnt from large scale text-only pretraining, such as in-context learning and free-form text generation. We keep the language model frozen, and finetune input and output linear layers to enable cross-modality interactions. This allows our model to process arbitrarily interleaved image-and-text inputs, and generate free-form text interleaved with retrieved images. We achieve strong zero-shot performance on grounded tasks such as contextual image retrieval and multimodal dialogue, and showcase compelling interactive abilities. Our approach works with any off-the-shelf language model and paves the way towards an effective, general solution for leveraging pretrained language models in visually grounded settings.
翻译:我们提出了一种高效方法,将预训练的纯文本语言模型锚定到视觉领域,使其能够处理任意交错的图像-文本数据,并生成与检索图像交错排布的文本。该方法充分利用了语言模型在大规模纯文本预训练中习得的能力,例如上下文学习与自由文本生成。我们保持语言模型参数冻结,仅微调输入与输出线性层以实现跨模态交互。这使得我们的模型能够处理任意交错的图文输入,并生成与检索图像交错的自由文本。在上下文图像检索与多模态对话等锚定任务中,我们取得了强大的零样本性能,并展现了引人注目的交互能力。本方法适用于任何现成的语言模型,为利用预训练语言模型完成视觉锚定任务开辟了一条高效且通用的解决路径。