The development of large language models (LLMs) has significantly enhanced the capabilities of multimodal LLMs (MLLMs) as general assistants. However, lack of user-specific knowledge still restricts their application in human's daily life. In this paper, we introduce the Retrieval Augmented Personalization (RAP) framework for MLLMs' personalization. Starting from a general MLLM, we turn it into a personalized assistant in three steps. (a) Remember: We design a key-value database to store user-related information, e.g., user's name, avatar and other attributes. (b) Retrieve: When the user initiates a conversation, RAP will retrieve relevant information from the database using a multimodal retriever. (c) Generate: The input query and retrieved concepts' information are fed into MLLMs to generate personalized, knowledge-augmented responses. Unlike previous methods, RAP allows real-time concept editing via updating the external database. To further improve generation quality and alignment with user-specific information, we design a pipeline for data collection and create a specialized dataset for personalized training of MLLMs. Based on the dataset, we train a series of MLLMs as personalized multimodal assistants. By pretraining on large-scale dataset, RAP-MLLMs can generalize to infinite visual concepts without additional finetuning. Our models demonstrate outstanding flexibility and generation quality across a variety of tasks, such as personalized image captioning, question answering and visual recognition. The code, data and models are available at https://github.com/Hoar012/RAP-MLLM.
翻译:大型语言模型(LLM)的发展显著增强了多模态大语言模型(MLLM)作为通用助手的能力。然而,缺乏用户特定知识仍限制其在人类日常生活中的应用。本文提出用于MLLM个性化的检索增强个性化(RAP)框架。我们从通用MLLM出发,通过三个步骤将其转化为个性化助手:(a)记忆:设计键值数据库存储用户相关信息,如姓名、头像及其他属性。(b)检索:当用户发起对话时,RAP通过多模态检索器从数据库中检索相关信息。(c)生成:将输入查询与检索到的概念信息输入MLLM,生成个性化、知识增强的响应。与先前方法不同,RAP支持通过更新外部数据库实现实时概念编辑。为提升生成质量及与用户特定信息的对齐度,我们设计了数据收集流程并创建了专门用于MLLM个性化训练的数据集。基于该数据集,我们训练了一系列作为个性化多模态助手的MLLM。通过大规模数据集预训练,RAP-MLLM无需额外微调即可泛化至无限视觉概念。我们的模型在个性化图像描述、问答及视觉识别等多种任务中展现出卓越的灵活性与生成质量。代码、数据及模型发布于https://github.com/Hoar012/RAP-MLLM。