Searching troves of videos with textual descriptions is a core multimodal retrieval task. Owing to the lack of a purpose-built dataset for text-to-video retrieval, video captioning datasets have been re-purposed to evaluate models by (1) treating captions as positive matches to their respective videos and (2) assuming all other videos to be negatives. However, this methodology leads to a fundamental flaw during evaluation: since captions are marked as relevant only to their original video, many alternate videos also match the caption, which introduces false-negative caption-video pairs. We show that when these false negatives are corrected, a recent state-of-the-art model gains 25\% recall points -- a difference that threatens the validity of the benchmark itself. To diagnose and mitigate this issue, we annotate and release 683K additional caption-video pairs. Using these, we recompute effectiveness scores for three models on two standard benchmarks (MSR-VTT and MSVD). We find that (1) the recomputed metrics are up to 25\% recall points higher for the best models, (2) these benchmarks are nearing saturation for Recall@10, (3) caption length (generality) is related to the number of positives, and (4) annotation costs can be mitigated through sampling. We recommend retiring these benchmarks in their current form, and we make recommendations for future text-to-video retrieval benchmarks.
翻译:用文本描述搜索视频库是一项核心的多模态检索任务。由于缺乏专门为文本-视频检索构建的数据集,视频描述数据集被重新用于评估模型,其方法为:(1) 将描述视为与其对应视频的正向匹配;(2) 假设所有其他视频均为负样本。然而,这种方法在评估中导致了一个根本性缺陷:由于描述仅标记为与其原始视频相关,许多其他视频也与描述匹配,从而引入了错误的负样本描述-视频对。我们证明,当这些假负例被纠正后,近期最优模型的召回率提升了25个百分点——这一差异足以威胁基准本身的有效性。为了诊断并缓解这一问题,我们标注并发布了68.3万个额外的描述-视频对。基于此,我们在两个标准基准(MSR-VTT和MSVD)上重新计算了三个模型的有效性分数。我们发现:(1) 最优模型的重计算指标召回率最高提升了25个百分点;(2) 这些基准的Recall@10已接近饱和;(3) 描述长度(通用性)与正样本数量相关;(4) 标注成本可通过采样来降低。我们建议淘汰当前形式的这些基准,并为未来的文本-视频检索基准提出建议。