Generative Video Diffusion for Unseen Cross-Domain Video Moment Retrieval

Video Moment Retrieval (VMR) requires precise modelling of fine-grained moment-text associations to capture intricate visual-language relationships. Due to the lack of a diverse and generalisable VMR dataset to facilitate learning scalable moment-text associations, existing methods resort to joint training on both source and target domain videos for cross-domain applications. Meanwhile, recent developments in vision-language multimodal models pre-trained on large-scale image-text and/or video-text pairs are only based on coarse associations (weakly labelled). They are inadequate to provide fine-grained moment-text correlations required for cross-domain VMR. In this work, we solve the problem of unseen cross-domain VMR, where certain visual and textual concepts do not overlap across domains, by only utilising target domain sentences (text prompts) without accessing their videos. To that end, we explore generative video diffusion for fine-grained editing of source videos controlled by the target sentences, enabling us to simulate target domain videos. We address two problems in video editing for optimising unseen domain VMR: (1) generation of high-quality simulation videos of different moments with subtle distinctions, (2) selection of simulation videos that complement existing source training videos without introducing harmful noise or unnecessary repetitions. On the first problem, we formulate a two-stage video diffusion generation controlled simultaneously by (1) the original video structure of a source video, (2) subject specifics, and (3) a target sentence prompt. This ensures fine-grained variations between video moments. On the second problem, we introduce a hybrid selection mechanism that combines two quantitative metrics for noise filtering and one qualitative metric for leveraging VMR prediction on simulation video selection.

翻译：视频时刻检索（VMR）需要对细粒度时刻-文本关联进行精确建模，以捕捉复杂的视觉-语言关系。由于缺乏多样性且可泛化的VMR数据集来促进可扩展的时刻-文本关联学习，现有方法在跨域应用中只能同时对源域和目标域视频进行联合训练。与此同时，基于大规模图文对和/或视频-文本对预训练的视觉-语言多模态模型的最新进展仅依赖粗粒度关联（弱标注），无法为跨域VMR提供所需的细粒度时刻-文本相关性。本研究通过仅利用目标域句子（文本提示）而不访问其视频，解决了未见跨域VMR问题——其中某些视觉和文本概念在域间无重叠。为此，我们探索了生成式视频扩散技术，通过目标句子控制对源视频进行细粒度编辑，从而模拟目标域视频。我们针对优化未见域VMR的视频编辑问题展开研究：问题一：生成具有细微差异的高质量不同时刻模拟视频；问题二：选择能补充现有源训练视频且不引入有害噪声或不必要重复的模拟视频。针对问题一，我们提出一种由源视频原始结构、主体特征和目标句子提示协同控制的两阶段视频扩散生成方法，确保视频时刻间的细粒度差异。针对问题二，我们引入混合选择机制，通过两种量化指标进行噪声过滤，并利用一种定性指标基于VMR预测结果优化模拟视频选择。