We explore how reconciling several foundation models (large language models and vision-language models) with a novel unified memory mechanism could tackle the challenging video understanding problem, especially capturing the long-term temporal relations in lengthy videos. In particular, the proposed multimodal agent VideoAgent: 1) constructs a structured memory to store both the generic temporal event descriptions and object-centric tracking states of the video; 2) given an input task query, it employs tools including video segment localization and object memory querying along with other visual foundation models to interactively solve the task, utilizing the zero-shot tool-use ability of LLMs. VideoAgent demonstrates impressive performances on several long-horizon video understanding benchmarks, an average increase of 6.6% on NExT-QA and 26.0% on EgoSchema over baselines, closing the gap between open-sourced models and private counterparts including Gemini 1.5 Pro.
翻译:我们探讨如何将多个基础模型(大型语言模型和视觉-语言模型)与一种新颖的统一记忆机制相结合,以应对具有挑战性的视频理解问题,特别是捕捉长视频中的长期时序关系。具体而言,所提出的多模态智能体VideoAgent:1)构建结构化记忆,存储视频的通用时序事件描述和基于对象的跟踪状态;2)给定输入任务查询时,它利用视频片段定位、对象记忆查询等工具以及其他视觉基础模型,借助大型语言模型的零样本工具使用能力交互式地完成任务。VideoAgent在多个长时程视频理解基准测试中展现了令人印象深刻的性能——在NExT-QA上平均提升6.6%,在EgoSchema上平均提升26.0%,缩小了开源模型与包括Gemini 1.5 Pro在内的私有模型之间的差距。