Personal Visual Context Learning in Large Multimodal Models

As wearable devices like smart glasses integrate Large Multimodal Models (LMMs) into the continuous first-person visual streams of individual users, the evolution of these models into true personal assistants hinges on visual personalization: the ability to reason over visual information unique to the wearer. We formalize this capability as Personal Visual Context Learning (Personal VCL), the prompt-time capability of using user-specific visual context to resolve personalized queries. To systematically evaluate this, we present Personal-VCL-Bench, a comprehensive benchmark capturing the personal visual world across persons, objects, and behaviors. Our analysis of frontier LMMs identifies a profound context utilization gap, revealing that the mechanisms for leveraging visual evidence, as well as aggregating multiple visual observations, remain critically understudied. Motivated by these findings, we propose the Agentic Context Bank, a strong inference-time baseline that structures a user's visual context into a self-refining memory bank and employs query-adaptive evidence selection. Our baseline approach consistently improves over standard context prompting regimes across tasks and evaluated backbones, demonstrating a practical path towards future personalized LMMs.

翻译：随着智能眼镜等可穿戴设备将大型多模态模型（LMMs）融入用户连续的、以第一人称视角呈现的视觉流中，这些模型演变为真正的个人助理的关键在于视觉个性化能力：即对佩戴者独特的视觉信息进行推理的能力。我们将这种能力形式化为个人视觉上下文学习（Personal VCL），这是一种在提示阶段利用用户特定的视觉上下文来解决个性化查询的能力。为了系统地评估这一能力，我们提出了Personal-VCL-Bench，这是一个全面的基准测试集，涵盖了涉及人物、物体和行为的个人视觉世界。通过对前沿LMMs的分析，我们发现了一个深层的上下文利用鸿沟，揭示了利用视觉证据以及聚合多个视觉观察的机制仍然严重缺乏研究。受这些发现启发，我们提出了Agentic Context Bank，这是一个强大的推理时基线方法，它将用户的视觉上下文结构化为一个自我精炼的记忆库，并采用查询自适应的证据选择。我们的基线方法在各项任务和评估的骨干网络上始终优于标准的上下文提示方案，为未来个性化LMMs的实现指明了一条实用路径。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【博士论文】基于多模态基础模型的上下文学习

专知会员服务

24+阅读 · 2025年12月17日

从感知到认知：多模态大语言模型中视觉-语言交互推理综述

专知会员服务

32+阅读 · 2025年10月1日

【博士论文】学习视觉-语言表示以实现多模态理解

专知会员服务

28+阅读 · 2025年2月8日

视频大模型中视觉上下文表示的scaling law

专知会员服务

24+阅读 · 2024年10月21日