This paper tackles an emerging and challenging problem of long video temporal grounding~(VTG) that localizes video moments related to a natural language (NL) query. Compared with short videos, long videos are also highly demanded but less explored, which brings new challenges in higher inference computation cost and weaker multi-modal alignment. To address these challenges, we propose CONE, an efficient COarse-to-fiNE alignment framework. CONE is a plug-and-play framework on top of existing VTG models to handle long videos through a sliding window mechanism. Specifically, CONE (1) introduces a query-guided window selection strategy to speed up inference, and (2) proposes a coarse-to-fine mechanism via a novel incorporation of contrastive learning to enhance multi-modal alignment for long videos. Extensive experiments on two large-scale long VTG benchmarks consistently show both substantial performance gains (e.g., from 3.13% to 6.87% on MAD) and state-of-the-art results. Analyses also reveal higher efficiency as the query-guided window selection mechanism accelerates inference time by 2x on Ego4D-NLQ and 15x on MAD while keeping SOTA results. Codes have been released at https://github.com/houzhijian/CONE.
翻译:本文针对长视频时间定位(VTG)这一新兴且具有挑战性的问题展开研究,该任务旨在定位与自然语言(NL)查询相关的视频片段。相较于短视频,长视频同样具有高需求但研究尚不充分,带来了推理计算成本更高和多模态对齐更弱等新挑战。为解决这些问题,我们提出CONE——一种高效的粗到细对齐框架。CONE是一种即插即用的框架,可在现有VTG模型基础上通过滑动窗口机制处理长视频。具体而言,CONE(1)引入查询引导的窗口选择策略加速推理,(2)通过创新性地结合对比学习提出粗到细机制,以增强长视频的多模态对齐。在两个大规模长视频VTG基准上的广泛实验一致表明,该方法不仅带来了显著的性能提升(例如在MAD上提升3.13%至6.87%),还取得了最先进的结果。分析还揭示其更高效率:查询引导的窗口选择机制在保持最优结果的同时,将Ego4D-NLQ上的推理时间加速2倍、MAD上加速15倍。代码已发布于https://github.com/houzhijian/CONE。