Hierarchical Long Video Understanding with Audiovisual Entity Cohesion and Agentic Search

Long video understanding presents significant challenges for vision-language models due to extremely long context windows. Existing solutions relying on naive chunking strategies with retrieval-augmented generation, typically suffer from information fragmentation and a loss of global coherence. We present HAVEN, a unified framework for long-video understanding that enables coherent and comprehensive reasoning by integrating audiovisual entity cohesion and hierarchical video indexing with agentic search. First, we preserve semantic consistency by integrating entity-level representations across visual and auditory streams, while organizing content into a structured hierarchy spanning global summary, scene, segment, and entity levels. Then we employ an agentic search mechanism to enable dynamic retrieval and reasoning across these layers, facilitating coherent narrative reconstruction and fine-grained entity tracking. Extensive experiments demonstrate that our method achieves good temporal coherence, entity consistency, and retrieval efficiency, establishing a new state-of-the-art with an overall accuracy of 84.1% on LVBench. Notably, it achieves outstanding performance in the challenging reasoning category, reaching 80.1%. These results highlight the effectiveness of structured, multimodal reasoning for comprehensive and context-consistent understanding of long-form videos.

翻译：长视频理解由于上下文窗口极长，对视觉-语言模型提出了重大挑战。现有解决方案通常依赖基于检索增强生成的简单分块策略，普遍存在信息碎片化和全局连贯性丧失的问题。我们提出了HAVEN，一个用于长视频理解的统一框架，该框架通过整合视听实体内聚、分层视频索引与智能搜索，实现了连贯且全面的推理。首先，我们通过整合视觉和听觉流中的实体级表征来保持语义一致性，同时将内容组织成涵盖全局摘要、场景、片段和实体级别的结构化层次。然后，我们采用智能搜索机制，实现跨这些层次的动态检索与推理，从而促进连贯的叙事重建和细粒度的实体追踪。大量实验表明，我们的方法在时间连贯性、实体一致性和检索效率方面均取得了良好效果，在LVBench上以84.1%的整体准确率确立了新的最优性能。值得注意的是，该方法在极具挑战性的推理类别中表现突出，达到了80.1%。这些结果突显了结构化、多模态推理对于实现全面且上下文一致的长视频理解的有效性。

相关内容

实体

关注 12

实体（entity）是有可区别性且独立存在的某种事物，但它不需要是物质上的存在。尤其是抽象和法律拟制也通常被视为实体。实体可被看成是一包含有子集的集合。在哲学里，这种集合被称为客体。实体可被使用来指涉某个可能是人、动物、植物或真菌等不会思考的生命、无生命物体或信念等的事物。在这一方面，实体可以被视为一全包的词语。有时，实体被当做本质的广义，不论即指的是否为物质上的存在，如时常会指涉到的无物质形式的实体－语言。更有甚者，实体有时亦指存在或本质本身。在法律上，实体是指能具有权利和义务的事物。这通常是指法人，但也包括自然人。

【NeurIPS2025】VideoLucy：用于长视频理解的深度记忆回溯机制

专知会员服务

9+阅读 · 2025年10月15日

【CVPR2025】重新思考长时视频理解中的时序检索

专知会员服务

13+阅读 · 2025年4月6日

【CUHK博士论文】构建高效且可扩展的视频理解AI模型

专知会员服务

16+阅读 · 2025年1月25日

探索长视频生成的最新趋势

专知会员服务

23+阅读 · 2024年12月30日