The recent advancements in generative language models have demonstrated their ability to memorize knowledge from documents and recall knowledge to respond to user queries effectively. Building upon this capability, we propose to enable multimodal large language models (MLLMs) to memorize and recall images within their parameters. Given a user query for visual content, the MLLM is anticipated to "recall" the relevant image from its parameters as the response. Achieving this target presents notable challenges, including inbuilt visual memory and visual recall schemes within MLLMs. To address these challenges, we introduce a generative cross-modal retrieval framework, which assigns unique identifier strings to represent images and involves two training steps: learning to memorize and learning to retrieve. The first step focuses on training the MLLM to memorize the association between images and their respective identifiers. The latter step teaches the MLLM to generate the corresponding identifier of the target image, given the textual query input. By memorizing images in MLLMs, we introduce a new paradigm to cross-modal retrieval, distinct from previous discriminative approaches. The experiments demonstrate that the generative paradigm performs effectively and efficiently even with large-scale image candidate sets.
翻译:近期生成式语言模型的进展展示了它们从文档中记忆知识并有效回忆知识以响应用户查询的能力。基于这一能力,我们提出使多模态大语言模型(MLLMs)能够在其参数中记忆和回忆图像。当用户查询视觉内容时,MLLM预期从其参数中“回忆”相关图像作为响应。实现这一目标面临显著挑战,包括在MLLMs中构建内置视觉记忆和视觉回忆机制。为解决这些挑战,我们引入了一个生成式跨模态检索框架,该框架为图像分配唯一标识符字符串,并包含两个训练步骤:学习记忆和学习检索。第一步聚焦于训练MLLM记忆图像与其对应标识符之间的关联;第二步则教导MLLM在给定文本查询输入时生成目标图像的对应标识符。通过在MLLMs中记忆图像,我们提出了一种不同于以往判别式方法的跨模态检索新范式。实验表明,即使面对大规模图像候选集,该生成式范式也能有效且高效地运行。