U-MARVEL: Unveiling Key Factors for Universal Multimodal Retrieval via Embedding Learning with MLLMs

Universal multimodal retrieval (UMR), which aims to address complex retrieval tasks where both queries and candidates span diverse modalities, has been significantly advanced by the emergence of MLLMs. While state-of-the-art MLLM-based methods in the literature predominantly adopt contrastive learning principles, they often differ in their specific training recipes. Despite their success, the mechanisms underlying their retrieval capabilities remain largely unexplored, potentially resulting in suboptimal performance and limited generalization ability. To address these issues, we present a comprehensive study aimed at uncovering the key factors that drive effective embedding learning for UMR using MLLMs. We begin by implementing a general MLLM-based embedding learning pipeline, and systematically analyze the primary contributors to high-performing universal retrieval systems. Based on this, we explore various aspects of the details in embedding generation and training strategies, including progressive transition, hard negative mining and re-ranker distillation. Notably, our findings reveal that often-overlooked factors can have a substantial impact on model performance. Building on these discoveries, we introduce a unified framework termed U-MARVEL (Universal MultimodAl RetrieVal via Embedding Learning), which outperforms state-of-the-art competitors on the M-BEIR benchmark by a large margin in supervised settings, and also exhibits strong zero-shot performance on several tasks such as composed image retrieval and text-to-video retrieval. These results underscore the generalization potential of our framework across various embedding-based retrieval tasks. Code is available at https://github.com/chaxjli/U-MARVEL

翻译：通用多模态检索旨在处理查询与候选对象均跨越多种模态的复杂检索任务，其发展因多模态大语言模型的出现而得到显著推动。尽管文献中基于MLLM的先进方法主要采用对比学习原理，但其具体的训练方案往往存在差异。尽管这些方法取得了成功，但其检索能力的内在机制在很大程度上仍未得到充分探索，这可能导致次优的性能和有限的泛化能力。为解决这些问题，我们开展了一项综合性研究，旨在揭示利用MLLM进行通用多模态检索时驱动有效嵌入学习的关键因素。我们首先实现了一个通用的基于MLLM的嵌入学习流程，并系统分析了高性能通用检索系统的主要贡献因素。在此基础上，我们深入探讨了嵌入生成与训练策略中多个维度的细节，包括渐进式过渡、困难负样本挖掘和重排序器蒸馏。值得注意的是，我们的研究发现，一些常被忽视的因素可能对模型性能产生实质性影响。基于这些发现，我们提出了一个名为U-MARVEL的统一框架。该框架在监督设置下，于M-BEIR基准测试中以显著优势超越了现有最优方法，并在组合图像检索、文本到视频检索等多个任务上展现出强大的零样本性能。这些结果凸显了我们的框架在各种基于嵌入的检索任务中具有的泛化潜力。代码已发布于 https://github.com/chaxjli/U-MARVEL