Universal Multimodal embedding models built on Multimodal Large Language Models (MLLMs) have traditionally employed contrastive learning, which aligns representations of query-target pairs across different modalities. Yet, despite its empirical success, they are primarily built on a "single-turn" formulation where each query-target pair is treated as an independent data point. This paradigm leads to computational inefficiency when scaling, as it requires a separate forward pass for each pair and overlooks potential contextual relationships between multiple queries that can relate to the same context. In this work, we introduce Multi-Turn Contrastive Learning (MuCo), a dialogue-inspired framework that revisits this process. MuCo leverages the conversational nature of MLLMs to process multiple, related query-target pairs associated with a single image within a single forward pass. This allows us to extract a set of multiple query and target embeddings simultaneously, conditioned on a shared context representation, amplifying the effective batch size and overall training efficiency. Experiments exhibit MuCo with a newly curated 5M multimodal multi-turn dataset (M3T), which yields state-of-the-art retrieval performance on MMEB and M-BEIR benchmarks, while markedly enhancing both training efficiency and representation coherence across modalities. Code and M3T are available at https://github.com/naver-ai/muco
翻译:基于多模态大语言模型(MLLMs)构建的通用多模态嵌入模型传统上采用对比学习,该方法旨在对齐不同模态间查询-目标对的表示。然而,尽管其在经验上取得了成功,这些模型主要建立在“单轮”范式之上,即每个查询-目标对被视作独立的数据点。这种范式在扩展时会导致计算效率低下,因为它需要为每一对进行单独的前向传播,并且忽略了多个可能与同一上下文相关的查询之间潜在的上下文关系。在本工作中,我们引入了多轮对比学习(MuCo),这是一个受对话启发的框架,重新审视了这一过程。MuCo利用MLLMs的对话特性,在单次前向传播中处理与单张图像相关的多个、相互关联的查询-目标对。这使得我们能够同时提取一组多个查询和目标嵌入,这些嵌入以共享的上下文表示为条件,从而放大了有效批处理大小并提升了整体训练效率。实验展示了MuCo配合一个新构建的包含500万样本的多模态多轮数据集(M3T)的效果,该模型在MMEB和M-BEIR基准测试中取得了最先进的检索性能,同时显著提升了训练效率以及跨模态的表示一致性。代码与M3T数据集发布于 https://github.com/naver-ai/muco