Long video understanding remains challenging for multimodal large language models (MLLMs) due to limited context windows, which necessitate identifying sparse query-relevant video segments. However, existing methods predominantly localize clues based solely on the query, overlooking the video's intrinsic structure and varying relevance across segments. To address this, we propose VideoDetective, a framework that integrates query-to-segment relevance and inter-segment affinity for effective clue hunting in long-video question answering. Specifically, we divide a video into various segments and represent them as a visual-temporal affinity graph built from visual similarity and temporal proximity. We then perform a Hypothesis-Verification-Refinement loop to estimate relevance scores of observed segments to the query and propagate them to unseen segments, yielding a global relevance distribution that guides the localization of the most critical segments for final answering with sparse observation. Experiments show our method consistently achieves substantial gains across a wide range of mainstream MLLMs on representative benchmarks, with accuracy improvements of up to 7.5% on VideoMME-long. Our code is available at https://videodetective.github.io/
翻译:长视频理解对多模态大语言模型而言仍具挑战性,其核心困难在于有限上下文窗口迫使模型需定位稀疏的查询相关视频片段。现有方法主要基于查询线索进行定位,却忽视了视频内在结构及片段间的关联性差异。为此,我们提出VideoDetective框架,通过整合查询-片段关联性与片段间亲和性,实现长视频问答中的高效线索猎取。具体而言,我们将视频划分为多个片段,基于视觉相似性与时间邻近性构建视觉-时间亲和图,并通过假设-验证-优化循环估算观测片段与查询的相关性得分,进而将这种关联传播至未观测片段,形成全局相关性分布。该分布可指导定位最关键的片段,用于稀疏观测条件下的最终问答。实验表明,在主流多模态大语言模型及代表性基准测试中,本方法均取得显著性能提升,其中VideoMME-long数据集准确率提升最高达7.5%。代码开源地址:https://videodetective.github.io/