Video grounding aims to localize the target moment in an untrimmed video corresponding to a given sentence query. Existing methods typically select the best prediction from a set of predefined proposals or directly regress the target span in a single-shot manner, resulting in the absence of a systematical prediction refinement process. In this paper, we propose DiffusionVG, a novel framework with diffusion models that formulates video grounding as a conditional generation task, where the target span is generated from Gaussian noise inputs and interatively refined in the reverse diffusion process. During training, DiffusionVG progressively adds noise to the target span with a fixed forward diffusion process and learns to recover the target span in the reverse diffusion process. In inference, DiffusionVG can generate the target span from Gaussian noise inputs by the learned reverse diffusion process conditioned on the video-sentence representations. Without bells and whistles, our DiffusionVG demonstrates superior performance compared to existing well-crafted models on mainstream Charades-STA, ActivityNet Captions and TACoS benchmarks.
翻译:视频定位旨在从未经裁剪的视频中定位与给定句子查询相对应的目标片段。现有方法通常从一组预定义候选提案中选择最佳预测结果,或通过一次性方式直接回归目标跨度,导致缺乏系统性的预测细化过程。本文提出DiffusionVG,一种基于扩散模型的新型框架,将视频定位形式化为条件生成任务,其中目标跨度从高斯噪声输入中生成,并在反向扩散过程中通过迭代方式进行细化。训练阶段,DiffusionVG通过固定的前向扩散过程逐步向目标跨度添加噪声,并学习在反向扩散过程中恢复目标跨度。推理阶段,DiffusionVG能够以视频-句子表示为条件,通过学习的反向扩散过程从高斯噪声输入生成目标跨度。无需额外复杂设计,我们的DiffusionVG在主流Charades-STA、ActivityNet Captions和TACoS基准测试中均展现出优于现有精心构建模型的性能。