Hierarchical Long Video Understanding with Audiovisual Entity Cohesion and Agentic Search

Long video understanding presents significant challenges for vision-language models due to extremely long context windows. Existing solutions relying on naive chunking strategies with retrieval-augmented generation, typically suffer from information fragmentation and a loss of global coherence. We present HAVEN, a unified framework for long-video understanding that enables coherent and comprehensive reasoning by integrating audiovisual entity cohesion and hierarchical video indexing with agentic search. First, we preserve semantic consistency by integrating entity-level representations across visual and auditory streams, while organizing content into a structured hierarchy spanning global summary, scene, segment, and entity levels. Then we employ an agentic search mechanism to enable dynamic retrieval and reasoning across these layers, facilitating coherent narrative reconstruction and fine-grained entity tracking. Extensive experiments demonstrate that our method achieves good temporal coherence, entity consistency, and retrieval efficiency, establishing a new state-of-the-art with an overall accuracy of 84.1% on LVBench. Notably, it achieves outstanding performance in the challenging reasoning category, reaching 80.1%. These results highlight the effectiveness of structured, multimodal reasoning for comprehensive and context-consistent understanding of long-form videos.

翻译：长视频理解因极长的上下文窗口，对视觉-语言模型构成了显著挑战。现有方案通常依赖朴素分块策略与检索增强生成技术，却常陷入信息碎片化与全局连贯性丧失的困境。我们提出HAVEN——一个面向长视频理解的统一框架，通过整合视听实体聚合、层次化视频索引与智能搜索机制，实现连贯且全面的推理。首先，我们通过融合视觉与听觉流中的实体级表征来保持语义一致性，同时将内容组织为覆盖全局摘要、场景、片段和实体层的结构化层次。随后采用智能搜索机制，实现跨层级的动态检索与推理，从而支撑连贯叙事重构与细粒度实体追踪。大量实验证明，本方法在时序连贯性、实体一致性及检索效率上表现优异，以LVBench上84.1%的整体准确率确立新标杆。尤其引人注目的是，在极具挑战性的推理类别中，该方法达到80.1%的准确率。这些结果凸显了结构化多模态推理对实现长视频全面且上下文一致理解的显著效能。

相关内容

实体

关注 12

实体（entity）是有可区别性且独立存在的某种事物，但它不需要是物质上的存在。尤其是抽象和法律拟制也通常被视为实体。实体可被看成是一包含有子集的集合。在哲学里，这种集合被称为客体。实体可被使用来指涉某个可能是人、动物、植物或真菌等不会思考的生命、无生命物体或信念等的事物。在这一方面，实体可以被视为一全包的词语。有时，实体被当做本质的广义，不论即指的是否为物质上的存在，如时常会指涉到的无物质形式的实体－语言。更有甚者，实体有时亦指存在或本质本身。在法律上，实体是指能具有权利和义务的事物。这通常是指法人，但也包括自然人。

视频理解：从几何与语义表征到统一模型架构

专知会员服务

20+阅读 · 3月21日

【AAAI2026】URaG：面向高效长文档理解的多模态大语言模型统一检索与生成框架

专知会员服务

14+阅读 · 2025年11月14日

【NeurIPS2025】VideoLucy：用于长视频理解的深度记忆回溯机制

专知会员服务

9+阅读 · 2025年10月15日

【CVPR2025】重新思考长时视频理解中的时序检索

专知会员服务

13+阅读 · 2025年4月6日