RAP：面向多模态大语言模型的检索增强个性化框架 (RAP: Retrieval-Augmented Personalization for Multimodal Large Language Models)

The development of large language models (LLMs) has significantly enhanced the capabilities of multimodal LLMs (MLLMs) as general assistants. However, lack of user-specific knowledge still restricts their application in human's daily life. In this paper, we introduce the Retrieval Augmented Personalization (RAP) framework for MLLMs' personalization. Starting from a general MLLM, we turn it into a personalized assistant in three steps. (a) Remember: We design a key-value database to store user-related information, e.g., user's name, avatar and other attributes. (b) Retrieve: When the user initiates a conversation, RAP will retrieve relevant information from the database using a multimodal retriever. (c) Generate: The input query and retrieved concepts' information are fed into MLLMs to generate personalized, knowledge-augmented responses. Unlike previous methods, RAP allows real-time concept editing via updating the external database. To further improve generation quality and alignment with user-specific information, we design a pipeline for data collection and create a specialized dataset for personalized training of MLLMs. Based on the dataset, we train a series of MLLMs as personalized multimodal assistants. By pretraining on large-scale dataset, RAP-MLLMs can generalize to infinite visual concepts without additional finetuning. Our models demonstrate outstanding flexibility and generation quality across a variety of tasks, such as personalized image captioning, question answering and visual recognition. The code, data and models are available at https://hoar012.github.io/RAP-Project/.

翻译：大语言模型（LLM）的发展显著增强了多模态大语言模型（MLLM）作为通用助手的能力。然而，用户特定知识的缺乏仍然限制了其在人类日常生活中的应用。本文提出了面向MLLM个性化的检索增强个性化（RAP）框架。我们从通用MLLM出发，通过三个步骤将其转化为个性化助手：（a）记忆：设计一个键值数据库来存储用户相关信息，例如用户姓名、头像及其他属性。（b）检索：当用户发起对话时，RAP将使用多模态检索器从数据库中检索相关信息。（c）生成：将输入查询与检索到的概念信息输入MLLM，以生成个性化、知识增强的响应。与先前方法不同，RAP允许通过更新外部数据库实现实时概念编辑。为进一步提升生成质量以及与用户特定信息的对齐度，我们设计了数据收集流程并创建了专门用于MLLM个性化训练的数据集。基于该数据集，我们训练了一系列MLLM作为个性化多模态助手。通过在大规模数据集上进行预训练，RAP-MLLM能够泛化至无限视觉概念而无需额外微调。我们的模型在个性化图像描述、问答及视觉识别等多种任务中展现出卓越的灵活性与生成质量。代码、数据及模型已在https://hoar012.github.io/RAP-Project/发布。