Recent studies have shown that agent-based systems leveraging large language models (LLMs) for key information retrieval and integration have emerged as a promising approach for long video understanding. However, these systems face two major challenges. First, they typically perform modeling and reasoning on individual frames, struggling to capture the temporal context of consecutive frames. Second, to reduce the cost of dense frame-level captioning, they adopt sparse frame sampling, which risks discarding crucial information. To overcome these limitations, we propose VideoLucy, a deep memory backtracking framework for long video understanding. Inspired by the human recollection process from coarse to fine, VideoLucy employs a hierarchical memory structure with progressive granularity. This structure explicitly defines the detail level and temporal scope of memory at different hierarchical depths. Through an agent-based iterative backtracking mechanism, VideoLucy systematically mines video-wide, question-relevant deep memories until sufficient information is gathered to provide a confident answer. This design enables effective temporal understanding of consecutive frames while preserving critical details. In addition, we introduce EgoMem, a new benchmark for long video understanding. EgoMem is designed to comprehensively evaluate a model's ability to understand complex events that unfold over time and capture fine-grained details in extremely long videos. Extensive experiments demonstrate the superiority of VideoLucy. Built on open-source models, VideoLucy significantly outperforms state-of-the-art methods on multiple long video understanding benchmarks, achieving performance even surpassing the latest proprietary models such as GPT-4o. Our code and dataset will be made publicly at https://videolucy.github.io
翻译:近期研究表明,基于智能体框架、利用大语言模型进行关键信息检索与整合的系统,已成为长视频理解领域的一种重要方法。然而,此类系统面临两大挑战:其一,它们通常在单帧图像上进行建模与推理,难以捕捉连续帧之间的时序上下文;其二,为降低密集帧级描述生成的计算成本,系统往往采用稀疏帧采样策略,这可能导致关键信息的丢失。为克服这些局限,我们提出VideoLucy——一个用于长视频理解的深度记忆回溯框架。受人类从粗到细的回忆过程启发,VideoLucy采用具有渐进粒度的层次化记忆结构。该结构在不同层次深度上明确定义了记忆的细节程度与时间范围。通过基于智能体的迭代回溯机制,VideoLucy系统性地挖掘视频范围内与问题相关的深度记忆,直至收集到足够信息以给出可靠答案。这一设计既能实现对连续帧的有效时序理解,又能保留关键细节。此外,我们提出了EgoMem——一个专为长视频理解设计的新基准数据集。EgoMem旨在全面评估模型对随时间展开的复杂事件的理解能力,以及对极长视频中细粒度细节的捕捉能力。大量实验证明了VideoLucy的优越性。基于开源模型构建的VideoLucy在多个长视频理解基准上显著超越现有最优方法,其性能甚至超过了GPT-4o等最新专有模型。我们的代码与数据集将在https://videolucy.github.io 公开。