Personalization of Large Vision-Language Models (LVLMs) involves customizing models to recognize specific users or object instances and to generate contextually tailored responses. Existing approaches rely on time-consuming training for each item, making them impractical for real-world deployment, as reflected in current personalization benchmarks limited to object-centric single-concept evaluations. In this paper, we present a novel training-free approach to LVLM personalization called \ours. We introduce a comprehensive, real-world benchmark designed to rigorously evaluate various aspects of the personalization task. \ours leverages pre-trained vision foundation models to extract distinctive features, applies retrieval-augmented generation (RAG) techniques to identify instances within visual inputs, and employs visual prompting strategies to guide model outputs. Our model-agnostic vision toolkit enables efficient and flexible multi-concept personalization across both images and videos, without any additional training. We achieve state-of-the-art results, surpassing existing training-based methods.
翻译:大型视觉-语言模型(LVLM)的个性化涉及定制模型以识别特定用户或对象实例,并生成符合场景的定制化响应。现有方法需要对每个项目进行耗时训练,难以实际部署,且当前个性化基准测试仅限于以对象为中心的单概念评估。本文提出一种名为\ours的新型无需训练的LVLM个性化方法,并引入一个全面的真实场景基准测试,用于严格评估个性化任务的多维度表现。\ours利用预训练视觉基础模型提取区分性特征,通过检索增强生成(RAG)技术识别视觉输入中的实例,并采用视觉提示策略引导模型输出。这一与模型无关的视觉工具包可高效灵活地实现图像和视频的多概念个性化,无需任何额外训练。我们的方法达到了领先水平,超越了现有基于训练的方法。