Long video understanding presents significant challenges for vision-language models due to extremely long context windows. Existing solutions relying on naive chunking strategies with retrieval-augmented generation, typically suffer from information fragmentation and a loss of global coherence. We present HAVEN, a unified framework for long-video understanding that enables coherent and comprehensive reasoning by integrating audiovisual entity cohesion and hierarchical video indexing with agentic search. First, we preserve semantic consistency by integrating entity-level representations across visual and auditory streams, while organizing content into a structured hierarchy spanning global summary, scene, segment, and entity levels. Then we employ an agentic search mechanism to enable dynamic retrieval and reasoning across these layers, facilitating coherent narrative reconstruction and fine-grained entity tracking. Extensive experiments demonstrate that our method achieves good temporal coherence, entity consistency, and retrieval efficiency, establishing a new state-of-the-art with an overall accuracy of 84.1% on LVBench. Notably, it achieves outstanding performance in the challenging reasoning category, reaching 80.1%. These results highlight the effectiveness of structured, multimodal reasoning for comprehensive and context-consistent understanding of long-form videos.
翻译:长视频理解因极长的上下文窗口,对视觉-语言模型构成了显著挑战。现有方案通常依赖朴素分块策略与检索增强生成技术,却常陷入信息碎片化与全局连贯性丧失的困境。我们提出HAVEN——一个面向长视频理解的统一框架,通过整合视听实体聚合、层次化视频索引与智能搜索机制,实现连贯且全面的推理。首先,我们通过融合视觉与听觉流中的实体级表征来保持语义一致性,同时将内容组织为覆盖全局摘要、场景、片段和实体层的结构化层次。随后采用智能搜索机制,实现跨层级的动态检索与推理,从而支撑连贯叙事重构与细粒度实体追踪。大量实验证明,本方法在时序连贯性、实体一致性及检索效率上表现优异,以LVBench上84.1%的整体准确率确立新标杆。尤其引人注目的是,在极具挑战性的推理类别中,该方法达到80.1%的准确率。这些结果凸显了结构化多模态推理对实现长视频全面且上下文一致理解的显著效能。