Despite recent progress in vision-language models (VLMs), existing approaches often fail to generate personalized responses based on the user's specific experiences, as they lack the ability to associate visual inputs with a user's accumulated visual-textual context. We newly formalize this challenge as contextualized visual personalization, which requires the visual recognition and textual retrieval of personalized visual experiences by VLMs when interpreting new images. To address this issue, we propose CoViP, a unified framework that treats personalized image captioning as a core task for contextualized visual personalization and improves this capability through reinforcement-learning-based post-training and caption-augmented generation. We further introduce diagnostic evaluations that explicitly rule out textual shortcut solutions and verify whether VLMs truly leverage visual context. Extensive experiments demonstrate that existing open-source and proprietary VLMs exhibit substantial limitations, while CoViP not only improves personalized image captioning but also yields holistic gains across downstream personalization tasks. These results highlight CoViP as a crucial stage for enabling robust and generalizable contextualized visual personalization.
翻译:尽管视觉语言模型(VLMs)近期取得了进展,但现有方法往往无法基于用户的具体经验生成个性化响应,因为它们缺乏将视觉输入与用户累积的视觉-文本上下文相关联的能力。我们首次将这一挑战形式化为上下文视觉个性化,它要求视觉语言模型在解释新图像时,能够识别并检索个性化的视觉经验。为解决这一问题,我们提出了CoViP,一个统一框架,将个性化图像描述作为上下文视觉个性化的核心任务,并通过基于强化学习的后训练和描述增强生成来提升这一能力。我们进一步引入了诊断性评估,明确排除了文本捷径解决方案,并验证视觉语言模型是否真正利用了视觉上下文。大量实验表明,现有的开源和专有视觉语言模型存在显著局限,而CoViP不仅改善了个性化图像描述,还在下游个性化任务中实现了整体性能提升。这些结果凸显了CoViP作为实现鲁棒且可泛化的上下文视觉个性化关键阶段的重要性。