Current vision-language models (VLMs) show exceptional abilities across diverse tasks, such as visual question answering. To enhance user experience, recent studies have investigated VLM personalization to understand user-provided concepts. However, they mainly focus on single concepts, neglecting the existence and interplay of multiple concepts, which limits real-world applicability. This paper proposes MC-LLaVA, a multi-concept personalization paradigm. Specifically, MC-LLaVA employs a multi-concept instruction tuning strategy, effectively integrating multiple concepts in a single training step. To reduce the training costs, we propose a personalized textual prompt that uses visual token information to initialize concept tokens. Additionally, we introduce a personalized visual prompt during inference, aggregating location maps for enhanced recognition and grounding capabilities. To further push the performance upper bound, we incorporate an optional auxiliary loss, better enhancing the proposed personalized prompts. To decorate the VLM personalization research, we contribute a high-quality dataset. We carefully collect images with multiple characters and objects from movies and manually create question-answer samples for multi-concept scenarios, featuring superior diversity. Comprehensive experiments demonstrate that MC-LLaVA achieves impressive multi-concept personalized responses, paving the way for VLMs to become better user assistants. The code and dataset will be released at \href{https://github.com/arctanxarc/MC-LLaVA}{https://github.com/arctanxarc/MC-LLaVA}.
翻译:当前视觉语言模型(VLM)在视觉问答等多种任务中展现出卓越能力。为提升用户体验,近期研究探索了VLM个性化以理解用户提供的概念。然而,现有方法主要关注单一概念,忽视了多概念的存在与相互作用,这限制了实际应用。本文提出MC-LLaVA,一种多概念个性化范式。具体而言,MC-LLaVA采用多概念指令微调策略,在单次训练步骤中有效整合多个概念。为降低训练成本,我们提出个性化文本提示,利用视觉令牌信息初始化概念令牌。此外,我们在推理阶段引入个性化视觉提示,通过聚合位置图以增强识别与定位能力。为进一步提升性能上限,我们引入可选的辅助损失函数,以更好地增强所提出的个性化提示。为丰富VLM个性化研究,我们贡献了一个高质量数据集。我们精心收集电影中含多角色与多物体的图像,并手动创建多概念场景的问答样本,具有卓越的多样性。综合实验表明,MC-LLaVA能够生成令人印象深刻的多概念个性化响应,为VLM成为更优质的用户助手开辟了道路。代码与数据集将在 \href{https://github.com/arctanxarc/MC-LLaVA}{https://github.com/arctanxarc/MC-LLaVA} 发布。