Multimodal embedding models, rooted in multimodal large language models (MLLMs), have yielded significant performance improvements across diverse tasks such as retrieval and classification. However, most existing approaches rely heavily on large-scale contrastive learning, with limited exploration of how the architectural and training paradigms of MLLMs affect embedding quality. While effective for generation, the causal attention and next-token prediction paradigm of MLLMs does not explicitly encourage the formation of globally compact representations, limiting their effectiveness as multimodal embedding backbones. To address this, we propose CoCoA, a Content reconstruction pre-training paradigm based on Collaborative Attention for multimodal embedding optimization. Specifically, we restructure the attention flow and introduce an EOS-based reconstruction task, encouraging the model to reconstruct input from the corresponding <EOS> embeddings. This drives the multimodal model to compress the semantic information of the input into the <EOS> token, laying the foundations for subsequent contrastive learning. Extensive experiments on MMEB-V1 demonstrate that CoCoA built upon Qwen2-VL and Qwen2.5-VL significantly improves embedding quality. Results validate that content reconstruction serves as an effective strategy to maximize the value of existing data, enabling multimodal embedding models generate compact and informative representations, raising their performance ceiling.
翻译:多模态嵌入模型根植于多模态大语言模型(MLLMs),已在检索、分类等多样化任务中取得了显著的性能提升。然而,现有方法大多严重依赖大规模对比学习,鲜有研究深入探讨MLLMs的架构与训练范式对嵌入质量的影响。尽管MLLMs的因果注意力与下一词元预测范式在生成任务中表现有效,但其并未显式鼓励形成全局紧凑的表示,从而限制了其作为多模态嵌入骨干网络的效能。为此,我们提出CoCoA——一种基于协作注意力机制的内容重建预训练范式,用于优化多模态嵌入。具体而言,我们重构注意力流程并引入基于<EOS>的重建任务,促使模型从对应的<EOS>嵌入中重建输入内容。这一机制驱动多模态模型将输入的语义信息压缩至<EOS>词元中,为后续对比学习奠定基础。在MMEB-V1上的大量实验表明,基于Qwen2-VL与Qwen2.5-VL构建的CoCoA显著提升了嵌入质量。实验结果验证了内容重建可作为最大化现有数据价值的有效策略,使多模态嵌入模型能够生成紧凑且信息丰富的表示,从而提升其性能上限。