Adapting generative Multimodal Large Language Models (MLLMs) into universal embedding models typically demands resource-intensive contrastive pre-training, while traditional hard negative mining methods suffer from severe false negative contamination. In this paper, we propose a highly data-efficient framework that bypasses extensive pre-training to build a robust multimodal representation space. We first introduce a hierarchical embedding prompt that provides strong latent conditioning. By explicitly anchoring task definitions at the system level, this prompting strategy effectively bridges the modality gap and unlocks powerful zero-shot embedding capabilities. Building upon this latent conditioning, we present Self-aware Hard Negative Sampling (SaHa). Unlike conventional candidate-space mining, SaHa shifts the mechanism to the query-space by mapping retrieved candidates back to their owner queries to rigorously filter out semantic false negatives. Furthermore, our method constructs mutually hard clusters, maximizing intra-task discrimination and batch efficiency without redundant forward passes. Extensive experiments demonstrate that our unified approach achieves highly competitive fine-tuning performance on the Massive Multimodal Embedding Benchmark using only a fraction of standard training data.
翻译:将生成式多模态大语言模型(MLLMs)适配为通用嵌入模型通常需要资源密集型的对比预训练,而传统的困难负样本挖掘方法则受到严重的假阴性污染问题困扰。本文提出了一种高度数据高效的框架,无需进行大量预训练即可构建鲁棒的多模态表示空间。我们首先引入一种分层嵌入提示,提供强大的潜在条件约束。通过在系统层面显式锚定任务定义,该提示策略有效弥合了模态鸿沟,并释放出强大的零样本嵌入能力。基于此潜在条件约束,我们提出了自感知困难负样本采样方法(SaHa)。与传统的候选空间挖掘不同,SaHa将机制转移到查询空间,通过将检索到的候选样本映射回其所属查询来严格过滤语义假阴性。此外,我们的方法构建了相互困难的聚类,在无需冗余前向传播的情况下,最大化任务内区分度和批次效率。大量实验表明,我们的统一方法仅使用标准训练数据的一小部分,就在大规模多模态嵌入基准测试中取得了极具竞争力的微调性能。