Recent large-scale vision-language models (VLMs) have demonstrated remarkable capabilities in understanding and generating textual descriptions for visual content. However, these models lack an understanding of user-specific concepts. In this work, we take a first step toward the personalization of VLMs, enabling them to learn and reason over user-provided concepts. For example, we explore whether these models can learn to recognize you in an image and communicate what you are doing, tailoring the model to reflect your personal experiences and relationships. To effectively recognize a variety of user-specific concepts, we augment the VLM with external concept heads that function as toggles for the model, enabling the VLM to identify the presence of specific target concepts in a given image. Having recognized the concept, we learn a new concept embedding in the intermediate feature space of the VLM. This embedding is tasked with guiding the language model to naturally integrate the target concept in its generated response. We apply our technique to BLIP-2 and LLaVA for personalized image captioning and further show its applicability for personalized visual question-answering. Our experiments demonstrate our ability to generalize to unseen images of learned concepts while preserving the model behavior on unrelated inputs.
翻译:近期大规模视觉语言模型(VLM)在理解并生成视觉内容的文本描述方面展现出卓越能力。然而,这些模型缺乏对用户特定概念的理解。本文首次探索VLM的个性化技术,使其能够学习并推理用户提供的概念。例如,我们探究这些模型能否学会在图像中识别你,并描述你正在进行的活动,从而定制模型以反映你的个人经历与人际关系。为有效识别各类用户特定概念,我们为VLM增补外部概念头模块,这些模块充当模型的切换开关,使VLM能够识别给定图像中特定目标概念的存在。在识别概念后,我们在VLM中间特征空间中学习新的概念嵌入,该嵌入负责引导语言模型在生成响应时自然整合目标概念。我们将该技术应用于BLIP-2与LLaVA模型,实现个性化图像描述,并进一步展示其在个性化视觉问答中的应用。实验证明,该方法既能泛化至已学概念的新图像,又能保持模型对无关输入的行为一致性。