Current vision-language models (VLMs) show exceptional abilities across diverse tasks including visual question answering. To enhance user experience in practical applications, recent studies investigate VLM personalization to understand user-provided concepts. However, existing studies mainly focus on single-concept personalization, neglecting the existence and interplay of multiple concepts, which limits the real-world applicability of personalized VLMs. In this paper, we propose the first multi-concept personalization method named MC-LLaVA along with a high-quality multi-concept personalization dataset. Specifically, MC-LLaVA uses a joint training strategy incorporating multiple concepts in a single training step, allowing VLMs to perform accurately in multi-concept personalization. To reduce the cost of joint training, MC-LLaVA leverages visual token information for concept token initialization, yielding improved concept representation and accelerating joint training. To advance multi-concept personalization research, we further contribute a high-quality dataset. We carefully collect images from various movies that contain multiple characters and manually generate the multi-concept question-answer samples. Our dataset features diverse movie types and question-answer types. We conduct comprehensive qualitative and quantitative experiments to demonstrate that MC-LLaVA can achieve impressive multi-concept personalized responses, paving the way for VLMs to become better user-specific assistants. The code and dataset will be publicly available at https://github.com/arctanxarc/MC-LLaVA.
翻译:当前的视觉语言模型(VLMs)在包括视觉问答在内的多种任务中展现出卓越的能力。为了提升实际应用中的用户体验,近期研究探索了VLM的个性化,以理解用户提供的概念。然而,现有研究主要集中于单概念个性化,忽略了多个概念的存在及其相互作用,这限制了个性化VLM在现实世界中的适用性。在本文中,我们提出了首个多概念个性化方法,命名为MC-LLaVA,并贡献了一个高质量的多概念个性化数据集。具体而言,MC-LLaVA采用一种联合训练策略,在单个训练步骤中融入多个概念,使VLMs能够在多概念个性化任务中准确执行。为了降低联合训练的成本,MC-LLaVA利用视觉令牌信息进行概念令牌初始化,从而改善了概念表示并加速了联合训练。为了推进多概念个性化研究,我们进一步贡献了一个高质量数据集。我们精心收集了来自多部包含多个角色的电影图像,并手动生成了多概念问答样本。我们的数据集具有多样化的电影类型和问答类型。我们进行了全面的定性和定量实验,以证明MC-LLaVA能够实现令人印象深刻的多概念个性化响应,为VLMs成为更好的用户专属助手铺平了道路。代码和数据集将在 https://github.com/arctanxarc/MC-LLaVA 公开提供。