Given a query, the task of Natural Language Video Localization (NLVL) is to localize a temporal moment in an untrimmed video that semantically matches the query. In this paper, we adopt a proposal-based solution that generates proposals (i.e., candidate moments) and then select the best matching proposal. On top of modeling the cross-modal interaction between candidate moments and the query, our proposed Moment Sampling DETR (MS-DETR) enables efficient moment-moment relation modeling. The core idea is to sample a subset of moments guided by the learnable templates with an adopted DETR (DEtection TRansformer) framework. To achieve this, we design a multi-scale visual-linguistic encoder, and an anchor-guided moment decoder paired with a set of learnable templates. Experimental results on three public datasets demonstrate the superior performance of MS-DETR.
翻译:给定一个查询,自然语言视频定位(NLVL)任务旨在从未裁剪的视频中定位与查询语义匹配的时间时刻。本文采用基于提议的解决方案,首先生成提议(即候选时刻),然后选择最佳匹配的提议。在建模候选时刻与查询之间的跨模态交互基础上,我们提出的时刻采样DETR(MS-DETR)实现了高效的时刻-时刻关系建模。核心思想是通过引入DETR(DEtection TRansformer)框架,在可学习模板引导下对候选时刻子集进行采样。为实现这一目标,我们设计了多尺度视觉-语言编码器,以及配备一组可学习模板的锚点引导时刻解码器。在三个公开数据集上的实验结果表明了MS-DETR的优越性能。