Video moment retrieval pursues an efficient and generalized solution to identify the specific temporal segments within an untrimmed video that correspond to a given language description. To achieve this goal, we provide a generative diffusion-based framework called MomentDiff, which simulates a typical human retrieval process from random browsing to gradual localization. Specifically, we first diffuse the real span to random noise, and learn to denoise the random noise to the original span with the guidance of similarity between text and video. This allows the model to learn a mapping from arbitrary random locations to real moments, enabling the ability to locate segments from random initialization. Once trained, MomentDiff could sample random temporal segments as initial guesses and iteratively refine them to generate an accurate temporal boundary. Different from discriminative works (e.g., based on learnable proposals or queries), MomentDiff with random initialized spans could resist the temporal location biases from datasets. To evaluate the influence of the temporal location biases, we propose two anti-bias datasets with location distribution shifts, named Charades-STA-Len and Charades-STA-Mom. The experimental results demonstrate that our efficient framework consistently outperforms state-of-the-art methods on three public benchmarks, and exhibits better generalization and robustness on the proposed anti-bias datasets. The code, model, and anti-bias evaluation datasets are available at https://github.com/IMCCretrieval/MomentDiff.
翻译:摘要:视频时刻检索旨在从非裁剪视频中定位与给定语言描述相对应的特定时间片段,寻求高效且通用的解决方案。为实现这一目标,我们提出了一种基于生成式扩散的框架MomentDiff,模拟了从随机浏览到逐步定位的典型人类检索过程。具体而言,我们首先将真实时间区间扩散为随机噪声,并在文本与视频相似性的引导下学习从随机噪声去噪到原始时间区间。这使得模型能够学习从任意随机位置到真实时刻的映射,从而具备从随机初始化定位片段的能力。训练完成后,MomentDiff可采样随机时间片段作为初始猜测,并通过迭代优化生成精确的时间边界。与判别式方法(例如基于可学习提议或查询的方法)不同,采用随机初始化时间区间的MomentDiff能够抵抗数据集中的时间位置偏差。为评估时间位置偏差的影响,我们构建了两个具有位置分布偏移的抗偏差数据集,命名为Charades-STA-Len和Charades-STA-Mom。实验结果表明,我们的高效框架在三个公开基准上 consistently 优于现有最先进方法,并在所提出的抗偏差数据集上展现出更优的泛化性和鲁棒性。代码、模型及抗偏差评估数据集发布于 https://github.com/IMCCretrieval/MomentDiff。