As wearable devices like smart glasses integrate Large Multimodal Models (LMMs) into the continuous first-person visual streams of individual users, the evolution of these models into true personal assistants hinges on visual personalization: the ability to reason over visual information unique to the wearer. We formalize this capability as Personal Visual Context Learning (Personal VCL), the prompt-time capability of using user-specific visual context to resolve personalized queries. To systematically evaluate this, we present Personal-VCL-Bench, a comprehensive benchmark capturing the personal visual world across persons, objects, and behaviors. Our analysis of frontier LMMs identifies a profound context utilization gap, revealing that the mechanisms for leveraging visual evidence, as well as aggregating multiple visual observations, remain critically understudied. Motivated by these findings, we propose the Agentic Context Bank, a strong inference-time baseline that structures a user's visual context into a self-refining memory bank and employs query-adaptive evidence selection. Our baseline approach consistently improves over standard context prompting regimes across tasks and evaluated backbones, demonstrating a practical path towards future personalized LMMs.
翻译:随着智能眼镜等可穿戴设备将大型多模态模型(LMMs)融入用户连续的、以第一人称视角呈现的视觉流中,这些模型演变为真正的个人助理的关键在于视觉个性化能力:即对佩戴者独特的视觉信息进行推理的能力。我们将这种能力形式化为个人视觉上下文学习(Personal VCL),这是一种在提示阶段利用用户特定的视觉上下文来解决个性化查询的能力。为了系统地评估这一能力,我们提出了Personal-VCL-Bench,这是一个全面的基准测试集,涵盖了涉及人物、物体和行为的个人视觉世界。通过对前沿LMMs的分析,我们发现了一个深层的上下文利用鸿沟,揭示了利用视觉证据以及聚合多个视觉观察的机制仍然严重缺乏研究。受这些发现启发,我们提出了Agentic Context Bank,这是一个强大的推理时基线方法,它将用户的视觉上下文结构化为一个自我精炼的记忆库,并采用查询自适应的证据选择。我们的基线方法在各项任务和评估的骨干网络上始终优于标准的上下文提示方案,为未来个性化LMMs的实现指明了一条实用路径。