MC-LLaVA: Multi-Concept Personalized Vision-Language Model

Current vision-language models (VLMs) show exceptional abilities across diverse tasks including visual question answering. To enhance user experience in practical applications, recent studies investigate VLM personalization to understand user-provided concepts. However, existing studies mainly focus on single-concept personalization, neglecting the existence and interplay of multiple concepts, which limits the real-world applicability of personalized VLMs. In this paper, we propose the first multi-concept personalization method named MC-LLaVA along with a high-quality multi-concept personalization dataset. Specifically, MC-LLaVA uses a joint training strategy incorporating multiple concepts in a single training step, allowing VLMs to perform accurately in multi-concept personalization. To reduce the cost of joint training, MC-LLaVA leverages visual token information for concept token initialization, yielding improved concept representation and accelerating joint training. To advance multi-concept personalization research, we further contribute a high-quality dataset. We carefully collect images from various movies that contain multiple characters and manually generate the multi-concept question-answer samples. Our dataset features diverse movie types and question-answer types. We conduct comprehensive qualitative and quantitative experiments to demonstrate that MC-LLaVA can achieve impressive multi-concept personalized responses, paving the way for VLMs to become better user-specific assistants. The code and dataset will be publicly available at https://github.com/arctanxarc/MC-LLaVA.

翻译：当前的视觉语言模型（VLMs）在包括视觉问答在内的多种任务中展现出卓越的能力。为了提升实际应用中的用户体验，近期研究探索了VLM的个性化，以理解用户提供的概念。然而，现有研究主要集中于单概念个性化，忽略了多个概念的存在及其相互作用，这限制了个性化VLM在现实世界中的适用性。在本文中，我们提出了首个多概念个性化方法，命名为MC-LLaVA，并贡献了一个高质量的多概念个性化数据集。具体而言，MC-LLaVA采用一种联合训练策略，在单个训练步骤中融入多个概念，使VLMs能够在多概念个性化任务中准确执行。为了降低联合训练的成本，MC-LLaVA利用视觉令牌信息进行概念令牌初始化，从而改善了概念表示并加速了联合训练。为了推进多概念个性化研究，我们进一步贡献了一个高质量数据集。我们精心收集了来自多部包含多个角色的电影图像，并手动生成了多概念问答样本。我们的数据集具有多样化的电影类型和问答类型。我们进行了全面的定性和定量实验，以证明MC-LLaVA能够实现令人印象深刻的多概念个性化响应，为VLMs成为更好的用户专属助手铺平了道路。代码和数据集将在 https://github.com/arctanxarc/MC-LLaVA 公开提供。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日