Video moment retrieval is a fundamental visual-language task that aims to retrieve target moments from an untrimmed video based on a language query. Existing methods typically generate numerous proposals manually or via generative networks in advance as the support set for retrieval, which is not only inflexible but also time-consuming. Inspired by the success of diffusion models on object detection, this work aims at reformulating video moment retrieval as a denoising generation process to get rid of the inflexible and time-consuming proposal generation. To this end, we propose a novel proposal-free framework, namely DiffusionVMR, which directly samples random spans from noise as candidates and introduces denoising learning to ground target moments. During training, Gaussian noise is added to the real moments, and the model is trained to learn how to reverse this process. In inference, a set of time spans is progressively refined from the initial noise to the final output. Notably, the training and inference of DiffusionVMR are decoupled, and an arbitrary number of random spans can be used in inference without being consistent with the training phase. Extensive experiments conducted on three widely-used benchmarks (i.e., QVHighlight, Charades-STA, and TACoS) demonstrate the effectiveness of the proposed DiffusionVMR by comparing it with state-of-the-art methods.
翻译:视频时刻检索是一项基础的视觉-语言任务,旨在根据语言查询从未经剪辑的视频中检索目标片段。现有方法通常事先通过人工或生成网络产生大量候选片段作为检索的支持集,这不仅缺乏灵活性,而且耗时。受扩散模型在目标检测领域成功应用的启发,本研究旨在将视频时刻检索重新定义为一个去噪生成过程,以摆脱僵化且耗时的候选片段生成步骤。为此,我们提出了一种新型无候选框架,即DiffusionVMR,该框架直接从噪声中随机采样时间跨度作为候选,并引入去噪学习来定位目标时刻。训练过程中,向真实时刻添加高斯噪声,模型学习如何逆向这一过程;推理时,从初始噪声逐步优化至最终输出的时间跨度集合。值得注意的是,DiffusionVMR的训练与推理是解耦的,推理时可使用任意数量的随机时间跨度,无需与训练阶段保持一致。在三个广泛使用的基准(即QVHighlight、Charades-STA和TACoS)上进行的大量实验表明,通过与现有最优方法的比较,所提出的DiffusionVMR具有有效性。