PEARL: Personalized Streaming Video Understanding Model

Yuanhong Zheng,Ruichuan An,Xiaopeng Lin,Yuxing Liu,Sihan Yang,Huanyu Zhang,Haodong Li,Qintong Zhang,Renrui Zhang,Guopeng Li,Yifan Zhang,Yuheng Li,Wentao Zhang

from arxiv, Arxiv Submission

Human cognition of new concepts is inherently a streaming process: we continuously recognize new objects or identities and update our memories over time. However, current multimodal personalization methods are largely limited to static images or offline videos. This disconnects continuous visual input from instant real-world feedback, limiting their ability to provide the real-time, interactive personalized responses essential for future AI assistants. To bridge this gap, we first propose and formally define the novel task of Personalized Streaming Video Understanding (PSVU). To facilitate research in this new direction, we introduce PEARL-Bench, the first comprehensive benchmark designed specifically to evaluate this challenging setting. It evaluates a model's ability to respond to personalized concepts at exact timestamps under two modes: (1) Frame-level, focusing on a specific person or object in discrete frames, and (2) a novel Video-level, focusing on personalized actions unfolding across continuous frames. PEARL-Bench comprises 132 unique videos and 2,173 fine-grained annotations with precise timestamps. Concept diversity and annotation quality are strictly ensured through a combined pipeline of automated generation and human verification. To tackle this challenging new setting, we further propose PEARL, a plug-and-play, training-free strategy that serves as a strong baseline. Extensive evaluations across 8 offline and online models demonstrate that PEARL achieves state-of-the-art performance. Notably, it brings consistent PSVU improvements when applied to 3 distinct architectures, proving to be a highly effective and robust strategy. We hope this work advances vision-language model (VLM) personalization and inspires further research into streaming personalized AI assistants. Code is available at https://github.com/Yuanhong-Zheng/PEARL.

翻译：人类对新概念的认知本质上是一个流式过程：我们持续识别新对象或身份，并随时间推移更新记忆。然而，当前多模态个性化方法大多局限于静态图像或离线视频。这种连续视觉输入与实时世界反馈的脱节，限制了其提供未来AI助手所需实时、交互式个性化响应的能力。为弥补这一差距，我们首先提出并正式定义了"个性化流式视频理解"（PSVU）这一新任务。为促进该方向研究，我们引入PEARL-Bench——首个专门针对这一挑战性场景设计的综合性基准。它从两种模式下评估模型在精确时间戳处对个性化概念的响应能力：（1）帧级模式，聚焦离散帧中的特定人物或物体；（2）新颖的视频级模式，聚焦跨连续帧展开的个性化动作。PEARL-Bench包含132个独特视频和2,173条带有精确时间戳的细粒度标注。概念多样性与标注质量通过自动化生成与人工验证相结合的系统化流程得到严格保障。为应对这一挑战性新场景，我们进一步提出PEARL——一种即插即用、无需训练的强基线策略。对8种离线与在线模型的广泛评估表明，PEARL实现了最先进性能。值得注意的是，当应用于3种不同架构时，它带来了持续的PSVU改进，证明自身是一种高效且鲁棒的方法。我们期待这项工作能推动视觉语言模型（VLM）个性化研究，并启发对流式个性化AI助手的进一步探索。代码见https://github.com/Yuanhong-Zheng/PEARL。