There is a growing interest in Universal Multimodal Embeddings (UME), where models are required to generate task-specific representations. While recent studies show that Multimodal Large Language Models (MLLMs) perform well on such tasks, they treat MLLMs solely as encoders, overlooking their generative capacity. However, such an encoding paradigm becomes less effective as instructions become more complex and require compositional reasoning. Inspired by the proven effectiveness of chain-of-thought reasoning, we propose a general Think-Then-Embed (TTE) framework for UME, composed of a reasoner and an embedder. The reasoner MLLM first generates reasoning traces that explain complex queries, followed by an embedder that produces representations conditioned on both the original query and the intermediate reasoning. This explicit reasoning step enables more nuanced understanding of complex multimodal instructions. Our contributions are threefold. First, by leveraging a powerful MLLM reasoner, we achieve state-of-the-art performance on the MMEB-V2 benchmark, surpassing proprietary models trained on massive in-house datasets. Second, to reduce the dependency on large MLLM reasoners, we finetune a smaller MLLM reasoner using high-quality embedding-centric reasoning traces, achieving the best performance among open-source models with a 7% absolute gain over recently proposed models. Third, we investigate strategies for integrating the reasoner and embedder into a unified model for improved efficiency without sacrificing performance.
翻译:通用多模态嵌入(UME)领域日益受到关注,该领域要求模型能够生成任务特定的表征。尽管近期研究表明多模态大语言模型(MLLMs)在此类任务中表现优异,但现有方法仅将MLLMs视为编码器,忽视了其生成能力。然而,随着指令复杂度增加并需要组合推理时,这种编码范式会逐渐失效。受思维链推理已证实有效性的启发,我们提出了一种通用的思维后嵌入(TTE)框架用于UME,该框架由推理器和嵌入器组成。推理器MLLM首先生成解释复杂查询的推理轨迹,随后嵌入器基于原始查询与中间推理生成条件化表征。这种显式推理步骤使得对复杂多模态指令的理解更加细致。我们的贡献有三方面。首先,通过利用强大的MLLM推理器,我们在MMEB-V2基准测试中实现了最先进的性能,超越了基于海量内部数据集训练的专有模型。其次,为减少对大型MLLM推理器的依赖,我们使用高质量的嵌入中心化推理轨迹对较小规模MLLM推理器进行微调,在开源模型中取得了最佳性能,相较近期提出的模型实现了7%的绝对性能提升。第三,我们研究了将推理器与嵌入器整合为统一模型的策略,在保持性能的同时提升了效率。