Augmenting pretrained language models (LMs) with a vision encoder (e.g., Flamingo) has obtained state-of-the-art results in image-to-text generation. However, these models store all the knowledge within their parameters, thus often requiring enormous model parameters to model the abundant visual concepts and very rich textual descriptions. Additionally, they are inefficient in incorporating new data, requiring a computational-expensive fine-tuning process. In this work, we introduce a Retrieval-augmented Visual Language Model, Re-ViLM, built upon the Flamingo, that supports retrieving the relevant knowledge from the external database for zero and in-context few-shot image-to-text generations. By storing certain knowledge explicitly in the external database, our approach reduces the number of model parameters and can easily accommodate new data during evaluation by simply updating the database. We also construct an interleaved image and text data that facilitates in-context few-shot learning capabilities. We demonstrate that Re-ViLM significantly boosts performance for image-to-text generation tasks, especially for zero-shot and few-shot generation in out-of-domain settings with 4 times less parameters compared with baseline methods.
翻译:通过向预训练语言模型(LM)添加视觉编码器(例如Flamingo),已在图像到文本生成任务中取得最优结果。然而,这类模型将所有知识存储于参数中,通常需要庞大的模型参数来建模丰富的视觉概念与极其详尽的文本描述。此外,它们在整合新数据时效率低下,需要高计算成本的微调过程。本文提出一种基于Flamingo的检索增强视觉语言模型Re-ViLM,支持从外部数据库中检索相关知识,实现零样本和上下文内少样本图像到文本生成。通过将特定知识显式存储于外部数据库,该方法可减少模型参数数量,并在评估阶段仅需更新数据库即可轻松适配新数据。我们还构建了交错排列的图像与文本数据,以增强上下文内少样本学习能力。实验表明,Re-ViLM在图像到文本生成任务中显著提升性能,尤其在跨域场景下的零样本与少样本生成中,参数规模较基线方法减少4倍。