MyVLM: Personalizing VLMs for User-Specific Queries

Recent large-scale vision-language models (VLMs) have demonstrated remarkable capabilities in understanding and generating textual descriptions for visual content. However, these models lack an understanding of user-specific concepts. In this work, we take a first step toward the personalization of VLMs, enabling them to learn and reason over user-provided concepts. For example, we explore whether these models can learn to recognize you in an image and communicate what you are doing, tailoring the model to reflect your personal experiences and relationships. To effectively recognize a variety of user-specific concepts, we augment the VLM with external concept heads that function as toggles for the model, enabling the VLM to identify the presence of specific target concepts in a given image. Having recognized the concept, we learn a new concept embedding in the intermediate feature space of the VLM. This embedding is tasked with guiding the language model to naturally integrate the target concept in its generated response. We apply our technique to BLIP-2 and LLaVA for personalized image captioning and further show its applicability for personalized visual question-answering. Our experiments demonstrate our ability to generalize to unseen images of learned concepts while preserving the model behavior on unrelated inputs.

翻译：近期大规模视觉语言模型（VLM）在理解并生成视觉内容的文本描述方面展现出卓越能力。然而，这些模型缺乏对用户特定概念的理解。本文首次探索VLM的个性化技术，使其能够学习并推理用户提供的概念。例如，我们探究这些模型能否学会在图像中识别你，并描述你正在进行的活动，从而定制模型以反映你的个人经历与人际关系。为有效识别各类用户特定概念，我们为VLM增补外部概念头模块，这些模块充当模型的切换开关，使VLM能够识别给定图像中特定目标概念的存在。在识别概念后，我们在VLM中间特征空间中学习新的概念嵌入，该嵌入负责引导语言模型在生成响应时自然整合目标概念。我们将该技术应用于BLIP-2与LLaVA模型，实现个性化图像描述，并进一步展示其在个性化视觉问答中的应用。实验证明，该方法既能泛化至已学概念的新图像，又能保持模型对无关输入的行为一致性。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日