Universal Multimodal embedding models built on Multimodal Large Language Models (MLLMs) have traditionally employed contrastive learning, which aligns representations of query-target pairs across different modalities. Yet, despite its empirical success, they are primarily built on a "single-turn" formulation where each query-target pair is treated as an independent data point. This paradigm leads to computational inefficiency when scaling, as it requires a separate forward pass for each pair and overlooks potential contextual relationships between multiple queries that can relate to the same context. In this work, we introduce Multi-Turn Contrastive Learning (MuCo), a dialogue-inspired framework that revisits this process. MuCo leverages the conversational nature of MLLMs to process multiple, related query-target pairs associated with a single image within a single forward pass. This allows us to extract a set of multiple query and target embeddings simultaneously, conditioned on a shared context representation, amplifying the effective batch size and overall training efficiency. Experiments exhibit MuCo with a newly curated 5M multimodal multi-turn dataset (M3T), which yields state-of-the-art retrieval performance on MMEB and M-BEIR benchmarks, while markedly enhancing both training efficiency and representation coherence across modalities. Code and M3T are available at https://github.com/naver-ai/muco
翻译:摘要:基于多模态大语言模型(MLLMs)构建的通用多模态嵌入模型传统上采用对比学习,该方式可对齐不同模态中查询-目标对的表示。然而,尽管取得了经验上的成功,此类模型主要建立在“单轮”范式上,即将每个查询-目标对视为独立的数据点。这种范式在扩展时会导致计算效率低下,因为每个配对需要单独的前向传播,且忽略了可能与同一上下文相关的多个查询之间的潜在语境关系。在本工作中,我们提出多轮对比学习(MuCo),一种受对话启发的框架,重新审视了这一过程。MuCo利用MLLMs的会话特性,在单次前向传播中处理与同一图像关联的多个相关查询-目标对。这使得我们能够基于共享的上下文表示,同时提取一组多个查询和目标嵌入,从而放大有效批次大小并提升整体训练效率。实验表明,结合新整理的500万规模多模态多轮数据集(M3T),MuCo在MMEB和M-BEIR基准测试中取得了最先进的检索性能,同时显著提升了训练效率和跨模态表示的一致性。代码与M3T数据集可在https://github.com/naver-ai/muco获取。