Video grounding aims to localize the target moment in an untrimmed video corresponding to a given sentence query. Existing methods typically select the best prediction from a set of predefined proposals or directly regress the target span in a single-shot manner, resulting in the absence of a systematical prediction refinement process. In this paper, we propose DiffusionVG, a novel framework with diffusion models that formulates video grounding as a conditional generation task, where the target span is generated from Gaussian noise inputs and interatively refined in the reverse diffusion process. During training, DiffusionVG progressively adds noise to the target span with a fixed forward diffusion process and learns to recover the target span in the reverse diffusion process. In inference, DiffusionVG can generate the target span from Gaussian noise inputs by the learned reverse diffusion process conditioned on the video-sentence representations. Our DiffusionVG follows the encoder-decoder architecture, which firstly encodes the video-sentence features and iteratively denoises the predicted spans in its specialized span refining decoder. Without bells and whistles, our DiffusionVG demonstrates competitive or even superior performance compared to existing well-crafted models on mainstream Charades-STA and ActivityNet Captions benchmarks.
翻译:视频定位旨在从无裁剪视频中定位与给定语句查询对应的目标时刻。现有方法通常从预定义候选提案中选择最佳预测,或采用单次预测方式直接回归目标区间,缺乏系统性的预测优化过程。本文提出DiffusionVG——一种基于扩散模型的新颖框架,将视频定位构建为条件生成任务,其中目标区间从高斯噪声输入生成,并通过反向扩散过程进行迭代优化。训练阶段,DiffusionVG通过固定前向扩散过程逐步向目标区间添加噪声,并学习在反向扩散过程中恢复目标区间。推理阶段,DiffusionVG利用学习到的、以视频-语句表示为条件的反向扩散过程,从高斯噪声输入生成目标区间。DiffusionVG采用编码器-解码器架构,首先编码视频-语句特征,随后在其专用区间优化解码器中迭代去噪预测区间。无需额外复杂设计,我们的DiffusionVG在主流Charades-STA和ActivityNet Captions基准测试中展现出与现有精心构建模型相当甚至更优的性能。