FocusGraph: Graph-Structured Frame Selection for Embodied Long Video Question Answering

The ability to understand long videos is vital for embodied intelligent agents, because their effectiveness depends on how well they can accumulate, organize, and leverage long-horizon perceptual memories. Recently, multimodal LLMs have been gaining popularity for solving the long video understanding task due to their general ability to understand natural language and to leverage world knowledge. However, as the number of frames provided to an MLLM increases, the quality of its responses tends to degrade, and inference time grows. Therefore, when using MLLMs for long video understanding, a crucial step is selecting key frames from the video to answer user queries. In this work, we develop FocusGraph, a framework for keyframe selection for question answering over long egocentric videos. It leverages a lightweight trainable Scene-Caption LLM Selector that selects query-relevant clips based on their graph-based captions, and a training-free method for selecting keyframes from these clips. Unlike existing methods, the proposed Scene-Caption LLM Selector does not rely on the original sequence of low-resolution frames; instead, it operates on a compact textual representation of the scene. We then design a training-free Patch-wise Sparse-Flow Retention (PSFR) method to select keyframes from the resulting sequence of clips, which are fed into an MLLM to produce the final answer. Together, these components enable FocusGraph to achieve state-of-the-art results on challenging egocentric long-video question answering benchmarks, including FindingDory and HourVideo, while significantly reducing inference time relative to baseline approaches.

翻译：理解长视频的能力对于具身智能体至关重要，因为其效能取决于其积累、组织和利用长时程感知记忆的能力。近年来，多模态大语言模型因其理解自然语言和利用世界知识的通用能力，在解决长视频理解任务中日益受到关注。然而，随着提供给多模态大语言模型的帧数增加，其响应质量往往会下降，且推理时间增长。因此，在使用多模态大语言模型进行长视频理解时，关键步骤是从视频中选取关键帧以回答用户查询。本研究提出FocusGraph，一个用于长时第一人称视频问答的关键帧选择框架。该框架利用一个轻量级可训练的“场景-描述”大语言模型选择器，基于图结构描述选择与查询相关的视频片段，并采用一种免训练方法从这些片段中选取关键帧。与现有方法不同，所提出的“场景-描述”大语言模型选择器不依赖于原始的低分辨率帧序列，而是基于场景的紧凑文本表示进行操作。随后，我们设计了一种免训练的“块级稀疏流保持”方法，从生成的片段序列中选择关键帧，这些关键帧被输入多模态大语言模型以生成最终答案。这些组件共同使FocusGraph在具有挑战性的第一人称长视频问答基准测试（包括FindingDory和HourVideo）上取得了最先进的性能，同时相较于基线方法显著减少了推理时间。