In Composed Video Retrieval, a video and a textual description which modifies the video content are provided as inputs to the model. The aim is to retrieve the relevant video with the modified content from a database of videos. In this challenging task, the first step is to acquire large-scale training datasets and collect high-quality benchmarks for evaluation. In this work, we introduce EgoCVR, a new evaluation benchmark for fine-grained Composed Video Retrieval using large-scale egocentric video datasets. EgoCVR consists of 2,295 queries that specifically focus on high-quality temporal video understanding. We find that existing Composed Video Retrieval frameworks do not achieve the necessary high-quality temporal video understanding for this task. To address this shortcoming, we adapt a simple training-free method, propose a generic re-ranking framework for Composed Video Retrieval, and demonstrate that this achieves strong results on EgoCVR. Our code and benchmark are freely available at https://github.com/ExplainableML/EgoCVR.
翻译:在组合视频检索任务中,模型接收一个视频和一段用于修改视频内容的文本描述作为输入。其目标是从视频数据库中检索出经过修改内容后的相关视频。在这一具有挑战性的任务中,首要步骤是获取大规模训练数据集并收集高质量基准进行评估。本工作中,我们提出了EgoCVR,一个利用大规模以自我为中心视频数据集构建的、用于细粒度组合视频检索的全新评估基准。EgoCVR包含2,295个查询,特别侧重于高质量的时间维度视频理解。我们发现,现有的组合视频检索框架未能实现此任务所需的高质量时间维度视频理解。为弥补这一不足,我们采用了一种简单的免训练方法,提出了一个通用的组合视频检索重排序框架,并证明其在EgoCVR上取得了优异的结果。我们的代码与基准已开源:https://github.com/ExplainableML/EgoCVR。