Recent multimodal models such as DALL-E and CM3 have achieved remarkable progress in text-to-image and image-to-text generation. However, these models store all learned knowledge (e.g., the appearance of the Eiffel Tower) in the model parameters, requiring increasingly larger models and training data to capture more knowledge. To integrate knowledge in a more scalable and modular way, we propose a retrieval-augmented multimodal model, which enables a base multimodal model (generator) to refer to relevant text and images fetched by a retriever from external memory (e.g., documents on the web). Specifically, for the retriever, we use a pretrained CLIP, and for the generator, we train a CM3 Transformer on the LAION dataset. Our resulting model, named Retrieval-Augmented CM3 (RA-CM3), is the first multimodal model that can retrieve and generate both text and images. We show that RA-CM3 significantly outperforms baseline multimodal models such as DALL-E and CM3 on both image and caption generation tasks (12 FID and 17 CIDEr improvements on MS-COCO), while requiring much less compute for training (<30% of DALL-E). Moreover, we show that RA-CM3 exhibits novel capabilities, such as faithful image generation and multimodal in-context learning (e.g., image generation from demonstrations).
翻译:近期多模态模型(如DALL-E和CM3)在文本到图像及图像到文本生成任务中取得了显著进展。然而,这些模型将所有习得知识(例如埃菲尔铁塔的外观)存储在模型参数中,需通过不断增大模型规模和训练数据量来捕获更多知识。为实现更具可扩展性和模块化的知识集成能力,我们提出了一种检索增强的多模态模型,该模型使基础多模态模型(生成器)能够引用检索器从外部记忆(如网络文档)中获取的相关文本和图像。具体而言,我们采用预训练CLIP作为检索器,并在LAION数据集上训练CM3 Transformer作为生成器。由此得到的模型——检索增强型CM3(RA-CM3)——成为首个兼具文本和图像检索与生成能力的多模态模型。实验表明,RA-CM3在图像生成和字幕生成任务中(MS-COCO数据集上FID降低12、CIDEr提升17)显著优于DALL-E和CM3等基线多模态模型,同时训练所需算力不足DALL-E的30%。此外,RA-CM3展现出新型能力,包括忠实图像生成与多模态上下文学习(例如通过示范示例进行图像生成)。