We propose an efficient method to ground pretrained text-only language models to the visual domain, enabling them to process and generate arbitrarily interleaved image-and-text data. Our method leverages the abilities of language models learnt from large scale text-only pretraining, such as in-context learning and free-form text generation. We keep the language model frozen, and finetune input and output linear layers to enable cross-modality interactions. This allows our model to process arbitrarily interleaved image-and-text inputs, and generate free-form text interleaved with retrieved images. We achieve strong zero-shot performance on grounded tasks such as contextual image retrieval and multimodal dialogue, and showcase compelling interactive abilities. Our approach works with any off-the-shelf language model and paves the way towards an effective, general solution for leveraging pretrained language models in visually grounded settings.
翻译:我们提出了一种高效方法,将预训练的纯文本语言模型锚定到视觉领域,使其能够处理并生成任意交错的图文数据。该方法充分利用语言模型从大规模纯文本预训练中习得的能力(如上下文学习和自由文本生成)。我们保持语言模型参数冻结,仅微调输入和输出线性层以实现跨模态交互。由此,模型可处理任意交错的图文输入,并生成与检索图像交错的自由文本。在上下文图像检索、多模态对话等锚定任务中,我们实现了强大的零样本性能,并展现了引人注目的交互能力。本方法适用于任何现成的语言模型,为在视觉锚定场景中有效利用预训练语言模型开辟了通用解决方案的路径。