Video Moment Retrieval (VMR) aims at retrieving the most relevant events from an untrimmed video with natural language queries. Existing VMR methods suffer from two defects: (1) massive expensive temporal annotations are required to obtain satisfying performance; (2) complicated cross-modal interaction modules are deployed, which lead to high computational cost and low efficiency for the retrieval process. To address these issues, we propose a novel method termed Cheaper and Faster Moment Retrieval (CFMR), which well balances the retrieval accuracy, efficiency, and annotation cost for VMR. Specifically, our proposed CFMR method learns from point-level supervision where each annotation is a single frame randomly located within the target moment. It is 6 times cheaper than the conventional annotations of event boundaries. Furthermore, we also design a concept-based multimodal alignment mechanism to bypass the usage of cross-modal interaction modules during the inference process, remarkably improving retrieval efficiency. The experimental results on three widely used VMR benchmarks demonstrate the proposed CFMR method establishes new state-of-the-art with point-level supervision. Moreover, it significantly accelerates the retrieval speed with more than 100 times FLOPs compared to existing approaches with point-level supervision.
翻译:视频时刻检索(Video Moment Retrieval, VMR)旨在通过自然语言查询从无裁剪视频中检索最相关的事件。现有VMR方法存在两个缺陷:(1)需要大量昂贵的时间标注才能获得令人满意的性能;(2)部署了复杂的跨模态交互模块,导致检索过程计算成本高、效率低。为解决这些问题,我们提出一种名为“更便宜更快时刻检索”(Cheaper and Faster Moment Retrieval, CFMR)的新方法,该方法在VMR的检索精度、效率和标注成本之间取得了良好平衡。具体而言,我们提出的CFMR方法学习点级监督,其中每个标注是目标时刻内随机定位的一个单帧。其成本比传统事件边界标注低6倍。此外,我们还设计了一种基于概念的多模态对齐机制,以在推理过程中绕过跨模态交互模块的使用,显著提高检索效率。在三个广泛使用的VMR基准上的实验结果表明,所提出的CFMR方法在点级监督下确立了新的最优性能。此外,与现有点级监督方法相比,其检索速度提升超过100倍,计算量(FLOPs)大幅降低。