Multimodal embeddings serve as a bridge for aligning vision and language, with the two primary implementations -- CLIP-based and MLLM-based embedding models -- both limited to capturing only global semantic information. Although numerous studies have focused on fine-grained understanding, we observe that complex scenarios currently targeted by MLLM embeddings often involve a hybrid perceptual pattern of both global and fine-grained elements, thus necessitating a compatible fusion mechanism. In this paper, we propose Adaptive Global and Fine-grained perceptual Fusion for MLLM Embeddings (AGFF-Embed), a method that prompts the MLLM to generate multiple embeddings focusing on different dimensions of semantic information, which are then adaptively and smoothly aggregated. Furthermore, we adapt AGFF-Embed with the Explicit Gradient Amplification (EGA) technique to achieve in-batch hard negatives enhancement without requiring fine-grained editing of the dataset. Evaluation on the MMEB and MMVP-VLM benchmarks shows that AGFF-Embed comprehensively achieves state-of-the-art performance in both general and fine-grained understanding compared to other multimodal embedding models.
翻译:多模态嵌入作为对齐视觉与语言信息的桥梁,其两种主流实现方式——基于CLIP的嵌入模型与基于MLLM的嵌入模型——均局限于捕获全局语义信息。尽管已有大量研究聚焦于细粒度理解,但我们观察到当前MLLM嵌入模型所针对的复杂场景通常同时包含全局与细粒度元素的混合感知模式,因此需要一种兼容的融合机制。本文提出一种适用于MLLM嵌入的自适应全局与细粒度感知融合方法(AGFF-Embed),该方法通过提示MLLM生成关注不同维度语义信息的多个嵌入表示,并对其进行自适应平滑聚合。此外,我们将AGFF-Embed与显式梯度放大(EGA)技术结合,实现在无需对数据集进行细粒度编辑的情况下增强批次内困难负样本。在MMEB与MMVP-VLM基准上的评估表明,相较于其他多模态嵌入模型,AGFF-Embed在通用理解与细粒度理解任务中均全面达到了最先进的性能水平。