Universal multimodal embedding models are foundational to various tasks. Existing approaches typically employ in-batch negative mining by measuring the similarity of query-candidate pairs. However, these methods often struggle to capture subtle semantic differences among candidates and lack diversity in negative samples. Moreover, the embeddings exhibit limited discriminative ability in distinguishing false and hard negatives. In this paper, we leverage the advanced understanding capabilities of MLLMs to enhance representation learning and present a novel Universal Multimodal Embedding (UniME-V2) model. Our approach first constructs a potential hard negative set through global retrieval. We then introduce the MLLM-as-a-Judge mechanism, which utilizes MLLMs to assess the semantic alignment of query-candidate pairs and generate soft semantic matching scores. These scores serve as a foundation for hard negative mining, mitigating the impact of false negatives and enabling the identification of diverse, high-quality hard negatives. Furthermore, the semantic matching scores are used as soft labels to mitigate the rigid one-to-one mapping constraint. By aligning the similarity matrix with the soft semantic matching score matrix, the model learns semantic distinctions among candidates, significantly enhancing its discriminative capacity. To further improve performance, we propose UniME-V2-Reranker, a reranking model trained on our mined hard negatives through a joint pairwise and listwise optimization approach. We conduct comprehensive experiments on the MMEB benchmark and multiple retrieval tasks, demonstrating that our method achieves state-of-the-art performance on average across all tasks.
翻译:通用多模态嵌入模型是众多任务的基础。现有方法通常通过度量查询-候选对之间的相似性来实施批内负样本挖掘。然而,这些方法往往难以捕捉候选样本间细微的语义差异,且负样本缺乏多样性。此外,现有嵌入模型在区分假负样本与困难负样本方面的判别能力有限。本文利用多模态大模型(MLLM)的先进理解能力来增强表示学习,并提出了一种新颖的通用多模态嵌入模型(UniME-V2)。我们的方法首先通过全局检索构建潜在的困难负样本集,随后引入MLLM-as-a-Judge机制,利用MLLM评估查询-候选对的语义对齐程度并生成软语义匹配分数。这些分数为困难负样本挖掘提供依据,既能减轻假负样本的影响,又能识别出多样化的高质量困难负样本。此外,语义匹配分数被用作软标签以缓解严格的一对一映射约束。通过将相似度矩阵与软语义匹配分数矩阵对齐,模型能够学习候选样本间的语义区分,从而显著提升其判别能力。为进一步提升性能,我们提出了UniME-V2-Reranker,这是一个基于联合成对与列表优化方法、在我们挖掘的困难负样本上训练的重新排序模型。我们在MMEB基准测试及多项检索任务上进行了全面实验,结果表明我们的方法在所有任务上平均达到了最先进的性能水平。