Improving Video Corpus Moment Retrieval with Partial Relevance Enhancement

Video corpus moment retrieval~(VCMR) is a new video retrieval task aimed at retrieving a relevant moment from a large corpus of untrimmed videos using a natural language text as query. The relevance between the video and query is partial, mainly evident in two aspects: (1) Scope: The untrimmed video contains information-rich frames, and not all are relevant to the query. Strong correlation is typically observed only within the relevant moment, emphasizing the importance of capturing key content. (2) Modality: The relevance of query to different modalities varies; action descriptions align more with the visual elements, while character conversations are more related to textual information. Recognizing and addressing these modality-specific nuances is crucial for effective retrieval in VCMR. However, existing methods often treat all video contents equally, leading to sub-optimal moment retrieval. We argue that effectively capturing the partial relevance between the query and video is essential for the VCMR task. To this end, we propose a Partial Relevance Enhanced Model~(PREM) to improve VCMR. VCMR involves two sub-tasks: video retrieval and moment localization. To align with their distinct objectives, we implement specialized partial relevance enhancement strategies. For video retrieval, we introduce a multi-modal collaborative video retriever, generating distinct query representations tailored for different modalities by modality-specific pooling, ensuring a more effective match. For moment localization, we propose the focus-then-fuse moment localizer, utilizing modality-specific gates to capture essential content, followed by fusing multi-modal information for moment localization. Experimental results on TVR and DiDeMo datasets show that the proposed model outperforms the baselines, achieving a new state-of-the-art of VCMR.

翻译：视频语料库时刻检索（VCMR）是一项新的视频检索任务，旨在通过自然语言文本作为查询，从大量未剪辑视频中检索出相关的时刻。视频与查询之间的相关性是部分的，主要体现在两个方面：（1）范围方面：未剪辑视频包含信息丰富的帧，并非所有帧都与查询相关。强相关性通常只出现在相关的时刻内，这强调了捕捉关键内容的重要性。（2）模态方面：查询与不同模态的相关性有所不同；动作描述与视觉元素更为一致，而角色对话则与文本信息更相关。识别并处理这些模态特定的细微差别对于VCMR中的有效检索至关重要。然而，现有方法通常对所有视频内容一视同仁，导致时刻检索效果次优。我们认为，有效捕捉查询与视频之间的部分相关性对于VCMR任务至关重要。为此，我们提出了一种部分相关性增强模型（PREM）来改进VCMR。VCMR包含两个子任务：视频检索和时刻定位。为与其不同目标对齐，我们实现了专门的部分相关性增强策略。对于视频检索，我们引入了多模态协作视频检索器，通过模态特定的池化为不同模态生成不同的查询表示，确保更有效的匹配。对于时刻定位，我们提出了“聚焦后融合”时刻定位器，利用模态特定的门控捕捉关键内容，随后融合多模态信息进行时刻定位。在TVR和DiDeMo数据集上的实验结果表明，所提出的模型优于基线方法，实现了VCMR的最新最优性能。