Generative Video Diffusion for Unseen Cross-Domain Video Moment Retrieval

Video Moment Retrieval (VMR) requires precise modelling of fine-grained moment-text associations to capture intricate visual-language relationships. Due to the lack of a diverse and generalisable VMR dataset to facilitate learning scalable moment-text associations, existing methods resort to joint training on both source and target domain videos for cross-domain applications. Meanwhile, recent developments in vision-language multimodal models pre-trained on large-scale image-text and/or video-text pairs are only based on coarse associations (weakly labelled). They are inadequate to provide fine-grained moment-text correlations required for cross-domain VMR. In this work, we solve the problem of unseen cross-domain VMR, where certain visual and textual concepts do not overlap across domains, by only utilising target domain sentences (text prompts) without accessing their videos. To that end, we explore generative video diffusion for fine-grained editing of source videos controlled by the target sentences, enabling us to simulate target domain videos. We address two problems in video editing for optimising unseen domain VMR: (1) generation of high-quality simulation videos of different moments with subtle distinctions, (2) selection of simulation videos that complement existing source training videos without introducing harmful noise or unnecessary repetitions. On the first problem, we formulate a two-stage video diffusion generation controlled simultaneously by (1) the original video structure of a source video, (2) subject specifics, and (3) a target sentence prompt. This ensures fine-grained variations between video moments. On the second problem, we introduce a hybrid selection mechanism that combines two quantitative metrics for noise filtering and one qualitative metric for leveraging VMR prediction on simulation video selection.

翻译：视频时刻检索（VMR）需要对细粒度的时刻-文本关联进行精确建模，以捕捉复杂的视觉-语言关系。由于缺乏多样且可泛化的VMR数据集来促进可扩展的时刻-文本关联学习，现有方法在跨域应用中不得不对源域和目标域视频进行联合训练。同时，近期基于大规模图像-文本和/或视频-文本对预训练的视觉-语言多模态模型仅依赖粗粒度关联（弱监督），无法提供跨域VMR所需的细粒度时刻-文本相关性。本研究通过仅利用目标域句子（文本提示）而不访问其对应视频，解决了视觉和文本概念在域间存在非重叠的未见跨域VMR问题。为此，我们探索生成式视频扩散技术，实现对受目标句子控制的源视频进行细粒度编辑，从而模拟目标域视频。针对优化未见域VMR中的视频编辑，我们解决两个问题：(1) 生成具有细微区分的不同时刻的高质量模拟视频；(2) 选择能够补充现有源训练视频且不引入有害噪声或不必要重复的模拟视频。针对第一个问题，我们提出一种两阶段视频扩散生成方法，同时由以下三个要素控制：(1) 源视频的原始视频结构，(2) 主体特征，以及(3) 目标语句提示。这确保了视频时刻间的细粒度变化。针对第二个问题，我们引入混合选择机制，结合两种用于噪声滤波的定量指标和一种利用VMR预测进行模拟视频选择的定性指标。