The advent of always-on personal AI assistants, enabled by all-day wearable devices such as smart glasses, demands a new level of contextual understanding, one that goes beyond short, isolated events to encompass the continuous, longitudinal stream of egocentric video. Achieving this vision requires advances in long-horizon video understanding, where systems must interpret and recall visual and audio information spanning days or even weeks. Existing methods, including large language models and retrieval-augmented generation, are constrained by limited context windows and lack the ability to perform compositional, multi-hop reasoning over very long video streams. In this work, we address these challenges through EGAgent, an enhanced agentic framework centered on entity scene graphs, which represent people, places, objects, and their relationships over time. Our system equips a planning agent with tools for structured search and reasoning over these graphs, as well as hybrid visual and audio search capabilities, enabling detailed, cross-modal, and temporally coherent reasoning. Experiments on the EgoLifeQA and Video-MME (Long) datasets show that our method achieves state-of-the-art performance on EgoLifeQA (57.5%) and competitive performance on Video-MME (Long) (74.1%) for complex longitudinal video understanding tasks. Code is available at https://github.com/facebookresearch/egagent.
翻译:随着全天候可穿戴设备(如智能眼镜)实现的全时在线个人AI助手的出现,对上下文理解提出了新的要求:这种理解需超越短暂、孤立的事件,涵盖连续、纵向的第一人称视频流。实现这一愿景需要长时域视频理解技术的进步,系统必须能够解读并回忆跨越数天甚至数周的视觉与音频信息。现有方法,包括大语言模型和检索增强生成,受限于有限的上下文窗口,且缺乏对超长视频流进行组合式多跳推理的能力。在本工作中,我们通过EGAgent应对这些挑战,这是一个以实体场景图为中心的增强型智能体框架,该图表示随时间变化的人物、地点、物体及其相互关系。我们的系统为规划智能体配备了工具,用于对这些图进行结构化搜索与推理,并具备混合视觉与音频搜索能力,从而实现细致、跨模态且时间连贯的推理。在EgoLifeQA和Video-MME(Long)数据集上的实验表明,对于复杂的纵向视频理解任务,我们的方法在EgoLifeQA上达到了最先进的性能(57.5%),并在Video-MME(Long)上取得了有竞争力的性能(74.1%)。代码可在https://github.com/facebookresearch/egagent获取。