Multimodal Large Language Models advance multimodal representation learning by acquiring transferable semantic embeddings, thereby substantially enhancing performance across a range of vision-language tasks, including cross-modal retrieval, clustering, and classification. An effective embedding is expected to comprehensively preserve the semantic content of the input while simultaneously emphasizing features that are discriminative for downstream tasks. Recent approaches demonstrate that MLLMs can be adapted into competitive embedding models via large-scale contrastive learning, enabling the simultaneous optimization of two complementary objectives. We argue that the two aforementioned objectives can be decoupled: a comprehensive understanding of the input enables the embedding model to achieve superior performance on downstream tasks via contrastive learning. In this paper, we propose CoMa, a compressed pre-training phase, which serves as a warm-up stage for contrastive learning. Experiments demonstrate that with only a small amount of pre-training data, we can transform an MLLM into a competitive embedding model. CoMa achieves new state-of-the-art results among MLLMs of comparable size on the MMEB, realizing optimization in both efficiency and effectiveness. Our project is available at https://github.com/Trustworthy-Information-Access/CoMa.
翻译:多模态大语言模型通过获取可迁移的语义嵌入来推进多模态表示学习,从而显著提升跨模态检索、聚类和分类等一系列视觉-语言任务的性能。有效的嵌入应既全面保留输入的语义内容,又同时突出对下游任务具有判别性的特征。近期研究表明,通过大规模对比学习,多模态大语言模型可以被改造为具有竞争力的嵌入模型,实现两个互补目标的同步优化。我们认为上述两个目标可以解耦:对输入的全面理解使嵌入模型能够通过对比学习在下游任务上实现更优性能。本文提出CoMa——一种压缩预训练阶段,作为对比学习的预热环节。实验表明,仅需少量预训练数据,即可将多模态大语言模型转化为具有竞争力的嵌入模型。在MMEB基准上,CoMa在同等规模的多模态大语言模型中取得了新的最优结果,实现了效率与效果的双重优化。我们的项目代码见https://github.com/Trustworthy-Information-Access/CoMa。