We propose the Multi-modal Untrimmed Video Retrieval task, along with a new benchmark (MUVR) to advance video retrieval for long-video platforms. MUVR aims to retrieve untrimmed videos containing relevant segments using multi-modal queries. It has the following features: 1) Practical retrieval paradigm: MUVR supports video-centric multi-modal queries, expressing fine-grained retrieval needs through long text descriptions, video tag prompts, and mask prompts. It adopts a one-to-many retrieval paradigm and focuses on untrimmed videos, tailored for long-video platform applications. 2) Multi-level visual correspondence: To cover common video categories (e.g., news, travel, dance) and precisely define retrieval matching criteria, we construct multi-level visual correspondence based on core video content (e.g., news events, travel locations, dance moves) which users are interested in and want to retrieve. It covers six levels: copy, event, scene, instance, action, and others. 3) Comprehensive evaluation criteria: We develop 3 versions of MUVR (i.e., Base, Filter, QA). MUVR-Base/Filter evaluates retrieval models, while MUVR-QA assesses MLLMs in a question-answering format. We also propose a Reranking Score to evaluate the reranking ability of MLLMs. MUVR consists of 53K untrimmed videos from the video platform Bilibili, with 1,050 multi-modal queries and 84K matches. Extensive evaluations of 3 state-of-the-art video retrieval models, 6 image-based VLMs, and 10 MLLMs are conducted. MUVR reveals the limitations of retrieval methods in processing untrimmed videos and multi-modal queries, as well as MLLMs in multi-video understanding and reranking. Our code and benchmark is available at https://github.com/debby-0527/MUVR.
翻译:我们提出了多模态未剪辑视频检索任务,并引入了一个新的基准(MUVR),以推动长视频平台的视频检索技术发展。MUVR旨在利用多模态查询检索包含相关片段的未剪辑视频。它具有以下特点:1)实用的检索范式:MUVR支持以视频为中心的多模态查询,通过长文本描述、视频标签提示和掩码提示来表达细粒度的检索需求。它采用一对多的检索范式,并专注于未剪辑视频,专为长视频平台应用而设计。2)多层级视觉对应关系:为了涵盖常见的视频类别(如新闻、旅行、舞蹈)并精确定义检索匹配标准,我们基于用户感兴趣并希望检索的核心视频内容(如新闻事件、旅行地点、舞蹈动作)构建了多层级视觉对应关系。它涵盖六个层级:复制、事件、场景、实例、动作和其他。3)全面的评估标准:我们开发了3个版本的MUVR(即Base、Filter、QA)。MUVR-Base/Filter用于评估检索模型,而MUVR-QA则以问答形式评估MLLMs。我们还提出了一个重排序分数来评估MLLMs的重排序能力。MUVR包含来自视频平台Bilibili的53K个未剪辑视频,1,050个多模态查询和84K个匹配项。我们对3个最先进的视频检索模型、6个基于图像的VLMs和10个MLLMs进行了广泛评估。MUVR揭示了检索方法在处理未剪辑视频和多模态查询方面的局限性,以及MLLMs在多视频理解和重排序方面的不足。我们的代码和基准可在https://github.com/debby-0527/MUVR获取。