Video corpus moment retrieval~(VCMR) is a new video retrieval task aimed at retrieving a relevant moment from a large corpus of untrimmed videos using a natural language text as query. The relevance between the video and query is partial, mainly evident in two aspects: (1) Scope: The untrimmed video contains information-rich frames, and not all are relevant to the query. Strong correlation is typically observed only within the relevant moment, emphasizing the importance of capturing key content. (2) Modality: The relevance of query to different modalities varies; action descriptions align more with the visual elements, while character conversations are more related to textual information. Recognizing and addressing these modality-specific nuances is crucial for effective retrieval in VCMR. However, existing methods often treat all video contents equally, leading to sub-optimal moment retrieval. We argue that effectively capturing the partial relevance between the query and video is essential for the VCMR task. To this end, we propose a Partial Relevance Enhanced Model~(PREM) to improve VCMR. VCMR involves two sub-tasks: video retrieval and moment localization. To align with their distinct objectives, we implement specialized partial relevance enhancement strategies. For video retrieval, we introduce a multi-modal collaborative video retriever, generating distinct query representations tailored for different modalities by modality-specific pooling, ensuring a more effective match. For moment localization, we propose the focus-then-fuse moment localizer, utilizing modality-specific gates to capture essential content, followed by fusing multi-modal information for moment localization. Experimental results on TVR and DiDeMo datasets show that the proposed model outperforms the baselines, achieving a new state-of-the-art of VCMR.
翻译:视频语料库时刻检索(VCMR)是一项新的视频检索任务,旨在通过自然语言文本作为查询,从大量未剪辑视频中检索出相关的时刻。视频与查询之间的相关性是部分的,主要体现在两个方面:(1)范围方面:未剪辑视频包含信息丰富的帧,并非所有帧都与查询相关。强相关性通常只出现在相关的时刻内,这强调了捕捉关键内容的重要性。(2)模态方面:查询与不同模态的相关性有所不同;动作描述与视觉元素更为一致,而角色对话则与文本信息更相关。识别并处理这些模态特定的细微差别对于VCMR中的有效检索至关重要。然而,现有方法通常对所有视频内容一视同仁,导致时刻检索效果次优。我们认为,有效捕捉查询与视频之间的部分相关性对于VCMR任务至关重要。为此,我们提出了一种部分相关性增强模型(PREM)来改进VCMR。VCMR包含两个子任务:视频检索和时刻定位。为与其不同目标对齐,我们实现了专门的部分相关性增强策略。对于视频检索,我们引入了多模态协作视频检索器,通过模态特定的池化为不同模态生成不同的查询表示,确保更有效的匹配。对于时刻定位,我们提出了“聚焦后融合”时刻定位器,利用模态特定的门控捕捉关键内容,随后融合多模态信息进行时刻定位。在TVR和DiDeMo数据集上的实验结果表明,所提出的模型优于基线方法,实现了VCMR的最新最优性能。