With the explosion of multimedia content, video moment retrieval (VMR), which aims to detect a video moment that matches a given text query from a video, has been studied intensively as a critical problem. However, the existing VMR framework evaluates video moment retrieval performance, assuming that a video is given, which may not reveal whether the models exhibit overconfidence in the falsely given video. In this paper, we propose the MVMR (Massive Videos Moment Retrieval for Faithfulness Evaluation) task that aims to retrieve video moments within a massive video set, including multiple distractors, to evaluate the faithfulness of VMR models. For this task, we suggest an automated massive video pool construction framework to categorize negative (distractors) and positive (false-negative) video sets using textual and visual semantic distance verification methods. We extend existing VMR datasets using these methods and newly construct three practical MVMR datasets. To solve the task, we further propose a strong informative sample-weighted learning method, CroCs, which employs two contrastive learning mechanisms: (1) weakly-supervised potential negative learning and (2) cross-directional hard-negative learning. Experimental results on the MVMR datasets reveal that existing VMR models are easily distracted by the misinformation (distractors), whereas our model shows significantly robust performance, demonstrating that CroCs is essential to distinguishing positive moments against distractors. Our code and datasets are publicly available: https://github.com/yny0506/Massive-Videos-Moment-Retrieval.
翻译:随着多媒体内容的爆炸式增长,视频片段检索(VMR)作为一个关键问题得到了深入研究,其目标是从视频中检测出与给定文本查询相匹配的视频片段。然而,现有的VMR框架在评估视频片段检索性能时,假设视频是给定的,这可能无法揭示模型是否对错误给定的视频表现出过度自信。在本文中,我们提出了MVMR(用于忠实性评估的海量视频片段检索)任务,该任务旨在从包含多个干扰项的海量视频集合中检索视频片段,以评估VMR模型的忠实性。为此任务,我们提出了一种自动化的海量视频池构建框架,利用文本和视觉语义距离验证方法对负样本(干扰项)和正样本(假负样本)视频集进行分类。我们使用这些方法扩展了现有的VMR数据集,并新构建了三个实用的MVMR数据集。为解决该任务,我们进一步提出了一种强信息性样本加权学习方法CroCs,该方法采用了两种对比学习机制:(1)弱监督潜在负样本学习,以及(2)跨方向硬负样本学习。在MVMR数据集上的实验结果表明,现有的VMR模型容易受到错误信息(干扰项)的干扰,而我们的模型表现出显著稳健的性能,证明CroCs对于区分正样本片段与干扰项至关重要。我们的代码和数据集已公开:https://github.com/yny0506/Massive-Videos-Moment-Retrieval。