Multimodal large language models advance multimodal representation learning by acquiring transferable semantic embeddings, thereby substantially enhancing performance across a range of vision-language tasks, including cross-modal retrieval, clustering, and classification. An effective embedding is expected to comprehensively preserve the semantic content of the input while simultaneously emphasizing features that are discriminative for downstream tasks. Recent approaches demonstrate that MLLMs can be adapted into competitive embedding models via large-scale contrastive learning, enabling the simultaneous optimization of two complementary objectives. We argue that the two aforementioned objectives can be decoupled: a comprehensive understanding of the input facilitates the embedding model in achieving superior performance in downstream tasks via contrastive learning. In this paper, we propose CoMa, a compressed pre-training phase, which serves as a warm-up stage for contrastive learning. Experiments demonstrate that with only a small amount of pre-training data, we can transform an MLLM into a competitive embedding model. CoMa achieves new state-of-the-art results among MLLMs of comparable size on the MMEB, realizing optimization in both efficiency and effectiveness.
翻译:多模态大语言模型通过获取可迁移的语义嵌入,推动了多模态表征学习的发展,从而显著提升了跨模态检索、聚类和分类等一系列视觉-语言任务的性能。一个有效的嵌入模型应能全面保留输入内容的语义信息,同时突出对下游任务具有区分性的特征。近期研究表明,通过大规模对比学习,可将MLLMs适配为具有竞争力的嵌入模型,从而实现两个互补目标的同步优化。我们认为上述两个目标可以解耦:对输入的全面理解有助于嵌入模型通过对比学习在下游任务中取得更优性能。本文提出CoMa,一种压缩预训练阶段,作为对比学习的热身阶段。实验表明,仅需少量预训练数据,即可将MLLM转化为具有竞争力的嵌入模型。CoMa在MMEB基准测试中取得了同规模MLLMs中最先进的性能,实现了效率与效果的双重优化。