Universal Multimodal Retrieval (UMR) aims to enable search across various modalities using a unified model, where queries and candidates can consist of pure text, images, or a combination of both. Previous work has attempted to adopt multimodal large language models (MLLMs) to realize UMR using only text data. However, our preliminary experiments demonstrate that more diverse multimodal training data can further unlock the potential of MLLMs. Despite its effectiveness, the existing multimodal training data is highly imbalanced in terms of modality, which motivates us to develop a training data synthesis pipeline and construct a large-scale, high-quality fused-modal training dataset. Based on the synthetic training data, we develop the General Multimodal Embedder (GME), an MLLM-based dense retriever designed for UMR. Furthermore, we construct a comprehensive UMR Benchmark (UMRB) to evaluate the effectiveness of our approach. Experimental results show that our method achieves state-of-the-art performance among existing UMR methods. Last, we provide in-depth analyses of model scaling, training strategies, and perform ablation studies on both the model and synthetic data.
翻译:通用多模态检索(UMR)旨在通过统一模型实现跨多种模态的搜索,其中查询项与候选内容可包含纯文本、图像或两者的组合。先前的研究尝试采用多模态大语言模型(MLLMs)仅利用文本数据实现UMR。然而,我们的初步实验表明,更多样化的多模态训练数据能进一步释放MLLMs的潜力。尽管现有方法有效,但当前多模态训练数据在模态分布上存在严重不平衡,这促使我们开发了一套训练数据合成流程,并构建了一个大规模、高质量的多模态融合训练数据集。基于合成训练数据,我们开发了通用多模态嵌入器(GME)——一种基于MLLM的、专为UMR设计的稠密检索器。此外,我们构建了综合性的UMR基准测试集(UMRB)以评估方法的有效性。实验结果表明,我们的方法在现有UMR方法中达到了最先进的性能。最后,我们对模型缩放、训练策略进行了深入分析,并对模型与合成数据进行了消融研究。