AI assistants that support humans in daily life are becoming increasingly feasible, driven by the rapid advancements in multimodal language models. A key challenge lies in overcoming the generic nature of these models to deliver personalized experiences. Existing approaches to personalizing large vision language models often rely on additional training stages, which limit generality and scalability, or on engineered pipelines with external pre-trained modules, which hinder deployment efficiency. In this work, we propose an efficient personalization method that leverages the model's inherent ability to capture personalized concepts. Specifically, we extract visual tokens that predominantly represent the target concept by utilizing the model's internal attention mechanisms. These tokens serve as a memory of that specific concept, enabling the model to recall and describe it when it appears in test images. We conduct a comprehensive and unified evaluation of our approach and SOTA methods across various personalization settings including single-concept, multi-concept, and video personalization, demonstrating strong performance gains with minimal personalization overhead.
翻译:支持人类日常生活的AI助手正因多模态语言模型的快速发展而日益可行。一个关键挑战在于克服这些模型的通用性以提供个性化体验。现有的大型视觉语言模型个性化方法通常依赖于额外的训练阶段,这限制了通用性和可扩展性,或依赖于包含外部预训练模块的工程化流程,从而影响了部署效率。在本研究中,我们提出了一种高效的个性化方法,该方法利用模型固有的捕获个性化概念的能力。具体而言,我们通过利用模型内部注意力机制提取主要代表目标概念的视觉标记。这些标记作为该特定概念的记忆,使模型能够在测试图像中出现该概念时进行回忆和描述。我们在包括单概念、多概念和视频个性化在内的多种个性化设置中,对我们的方法与现有最先进方法进行了全面统一的评估,结果表明我们的方法以最小的个性化开销实现了显著的性能提升。