UniCVR: From Alignment to Reranking for Unified Zero-Shot Composed Visual Retrieval

Composed image retrieval, multi-turn composed image retrieval, and composed video retrieval all share a common paradigm: composing the reference visual with modification text to retrieve the desired target. Despite this shared structure, the three tasks have been studied in isolation, with no prior work proposing a unified framework, let alone a zero-shot solution. In this paper, we propose UniCVR, the first unified zero-shot composed visual retrieval framework that jointly addresses all three tasks without any task-specific human-annotated data. UniCVR strategically combines two complementary strengths: Multimodal Large Language Models (MLLMs) for compositional query understanding and Vision-Language Pre-trained (VLP) models for structured visual retrieval. Concretely, UniCVR operates in two stages. In Stage I, we train the MLLM as a compositional query embedder via contrastive learning on a curated multi-source dataset of approximately 3.5M samples, bridging the heterogeneous embedding spaces between the MLLM and the frozen VLP gallery encoder. A cluster-based hard negative sampling strategy is proposed to strengthen contrastive supervision. In Stage II, we introduce an MLLM-guided dual-level reranking mechanism that applies adaptive budgeted subset scoring to a small number of top-ranked candidates, and then exploits the resulting relevance signals through a dual-level re-scoring scheme, producing more accurate final rankings with minimal computational overhead. Extensive experiments across five benchmarks covering all three tasks demonstrate that UniCVR achieves cutting-edge performance, validating its effectiveness and generalizability. Our data and code will be released upon acceptance.

翻译：组合图像检索、多轮组合图像检索与组合视频检索共享同一范式：将参考视觉信息与修改文本结合以检索期望目标。尽管结构相似，这三类任务此前一直独立研究，尚无统一框架的提出，更不必说零样本解决方案。本文提出UniCVR——首个统一零样本组合视觉检索框架，无需任何任务专用的人工标注数据即可联合处理三类任务。UniCVR策略性地融合了两类互补优势：多模态大语言模型（MLLM）用于组合查询理解，以及视觉语言预训练（VLP）模型用于结构化视觉检索。具体而言，UniCVR采用两阶段架构。第一阶段，我们在包含约350万样本的多源数据集上，通过对比学习训练MLLM作为组合查询嵌入器，从而桥接MLLM与冻结VLP图库编码器之间的异构嵌入空间，并提出基于聚类的难负样本采样策略以增强对比监督。第二阶段，我们引入MLLM引导的双层重排机制：先对少量排名靠前的候选结果执行自适应预算子集评分，再通过双层重新评分方案利用所得相关性信号，以最小计算开销生成更精确的最终排序。在覆盖三类任务的五个基准数据集上的大量实验表明，UniCVR实现了最先进的性能，验证了其有效性与泛化能力。相关数据与代码将在录用后开源。