Understanding long video content is a complex endeavor that often relies on densely sampled frame captions or end-to-end feature selectors, yet these techniques commonly overlook the logical relationships between textual queries and visual elements. In practice, computational constraints necessitate coarse frame subsampling, a challenge analogous to ``finding a needle in a haystack.'' To address this issue, we introduce a semantics-driven search framework that reformulates keyframe selection under the paradigm of Visual Semantic-Logical Search. Specifically, we systematically define four fundamental logical dependencies: 1) spatial co-occurrence, 2) temporal proximity, 3) attribute dependency, and 4) causal order. These relations dynamically update frame sampling distributions through an iterative refinement process, enabling context-aware identification of semantically critical frames tailored to specific query requirements. Our method establishes new SOTA performance on the manually annotated benchmark in key-frame selection metrics. Furthermore, when applied to downstream video question-answering tasks, the proposed approach demonstrates the best performance gains over existing methods on LongVideoBench and Video-MME, validating its effectiveness in bridging the logical gap between textual queries and visual-temporal reasoning. The code will be publicly available.
翻译:理解长视频内容是一项复杂的任务,通常依赖于密集采样的帧描述或端到端的特征选择器,然而这些技术常常忽略了文本查询与视觉元素之间的逻辑关系。在实际应用中,计算约束要求进行粗略的帧下采样,这一挑战类似于“大海捞针”。为解决此问题,我们引入了一种语义驱动的搜索框架,将关键帧选择问题重新表述为视觉语义-逻辑搜索范式。具体而言,我们系统性地定义了四种基本逻辑依赖关系:1) 空间共现,2) 时间邻近,3) 属性依赖,以及4) 因果顺序。这些关系通过迭代优化过程动态更新帧采样分布,从而能够根据特定查询需求,实现上下文感知的语义关键帧识别。我们的方法在人工标注基准测试的关键帧选择指标上取得了新的SOTA性能。此外,当应用于下游视频问答任务时,所提出的方法在LongVideoBench和Video-MME数据集上相比现有方法取得了最佳的性能提升,验证了其在弥合文本查询与视觉时序推理之间逻辑鸿沟方面的有效性。代码将公开提供。