Modern video retrieval systems are expected to handle diverse tasks ranging from corpus-level retrieval, fine-grained moment localization to flexible multimodal querying. Specialized architectures achieve strong retrieval performance by training modality-specific encoders on massive datasets, but they lack the ability to process composed multimodal queries. In contrast, multimodal LLM (MLLM)-based methods support rich multimodal search but their retrieval performance remains well below that of specialized systems. We present VeRVE, an MLLM-based versatile video retrieval framework that integrates corpus and moment-level retrieval capabilities while accommodating composed multimodal queries within a single architecture. We use contrastive alignment of visual and textual embeddings generated using a shared MLLM backbone to facilitate efficient embedding-based candidate search. Our embedding model, trained efficiently using low-rank adaptation (LoRA) on 700K paired visual-text data samples, surpasses other MLLM-based methods on zero-shot video retrieval tasks. Additionally, we demonstrate that the same model can be adapted without further training to achieve competitive results on zero-shot moment retrieval, and state of the art results for zero-shot composed video retrieval. With additional training for reranking candidates identified in the embedding-based search, our model substantially outperforms existing MLLM-based retrieval systems and achieves retrieval performance comparable to state of the art specialized models.
翻译:现代视频检索系统需处理从语料级检索、细粒度时刻定位到灵活多模态查询的多样化任务。专用架构通过在庞大数据集上训练模态特定编码器取得强劲检索性能,但难以处理组合式多模态查询。相比之下,基于多模态大语言模型(MLLM)的方法支持丰富的多模态搜索,但其检索性能仍远低于专用系统。我们提出VeRVE——基于MLLM的多功能视频检索框架,该框架在单一架构中整合了语料级与时刻级检索能力,同时支持组合式多模态查询。我们利用共享MLLM主干生成的视觉与文本嵌入进行对比对齐,实现高效的基于嵌入的候选搜索。该嵌入模型仅通过低秩适配(LoRA)在70万对视觉-文本数据样本上高效训练,即在零样本视频检索任务上超越其他基于MLLM的方法。此外,我们证明同一模型无需额外训练即可适配零样本时刻检索并取得竞争性结果,同时在零样本组合式视频检索上达到最优性能。通过在嵌入搜索基础上对重排序候选进行额外训练,我们的模型显著超越现有基于MLLM的检索系统,其检索性能可比肩最先进专用模型。