Long video understanding presents significant challenges for vision-language models due to extremely long context windows. Existing solutions relying on naive chunking strategies with retrieval-augmented generation, typically suffer from information fragmentation and a loss of global coherence. We present HAVEN, a unified framework for long-video understanding that enables coherent and comprehensive reasoning by integrating audiovisual entity cohesion and hierarchical video indexing with agentic search. First, we preserve semantic consistency by integrating entity-level representations across visual and auditory streams, while organizing content into a structured hierarchy spanning global summary, scene, segment, and entity levels. Then we employ an agentic search mechanism to enable dynamic retrieval and reasoning across these layers, facilitating coherent narrative reconstruction and fine-grained entity tracking. Extensive experiments demonstrate that our method achieves good temporal coherence, entity consistency, and retrieval efficiency, establishing a new state-of-the-art with an overall accuracy of 84.1% on LVBench. Notably, it achieves outstanding performance in the challenging reasoning category, reaching 80.1%. These results highlight the effectiveness of structured, multimodal reasoning for comprehensive and context-consistent understanding of long-form videos.
翻译:长视频理解由于上下文窗口极长,对视觉-语言模型提出了重大挑战。现有解决方案通常依赖基于检索增强生成的简单分块策略,普遍存在信息碎片化和全局连贯性丧失的问题。我们提出了HAVEN,一个用于长视频理解的统一框架,该框架通过整合视听实体内聚、分层视频索引与智能搜索,实现了连贯且全面的推理。首先,我们通过整合视觉和听觉流中的实体级表征来保持语义一致性,同时将内容组织成涵盖全局摘要、场景、片段和实体级别的结构化层次。然后,我们采用智能搜索机制,实现跨这些层次的动态检索与推理,从而促进连贯的叙事重建和细粒度的实体追踪。大量实验表明,我们的方法在时间连贯性、实体一致性和检索效率方面均取得了良好效果,在LVBench上以84.1%的整体准确率确立了新的最优性能。值得注意的是,该方法在极具挑战性的推理类别中表现突出,达到了80.1%。这些结果突显了结构化、多模态推理对于实现全面且上下文一致的长视频理解的有效性。