Multimodal Retrieval-Augmented Generation (MRAG) addresses key limitations of Multimodal Large Language Models (MLLMs), such as hallucination and outdated knowledge. However, current MRAG systems struggle to distinguish whether retrieved multimodal data truly supports the semantic core of an answer or merely provides superficial relevance. Existing metrics often rely on heuristic position-based confidence, which fails to capture the informational density of multimodal entities. To address this, we propose Multi-modal Evidence Grounding (MEG), a semantic-aware metric that quantifies the contribution of retrieved evidence. Unlike standard confidence measures, MEG utilizes Semantic Certainty Anchoring, focusing on high-IDF information-bearing tokens that better capture the semantic core of the answer. Building on MEG, we introduce MEG-RAG, a framework that trains a multimodal reranker to align retrieved evidence with the semantic anchors of the ground truth. By prioritizing high-value content based on semantic grounding rather than token probability distributions, MEG-RAG improves the accuracy and multimodal consistency of generated outputs. Extensive experiments on the M$^2$RAG benchmark show that MEG-RAG consistently outperforms strong baselines and demonstrates robust generalization across different teacher models.
翻译:摘要:多模态检索增强生成(MRAG)解决了多模态大语言模型(MLLMs)的关键局限,如幻觉现象和知识过时问题。然而,当前MRAG系统难以区分检索到的多模态数据是否真正支撑答案的语义核心,或仅提供表面相关性。现有指标通常依赖基于位置的启发式置信度,无法捕捉多模态实体的信息密度。为此,我们提出多模态证据基础(MEG)——一种语义感知度量方法,可量化检索证据的贡献。与标准置信度度量不同,MEG采用语义确定性锚定机制,聚焦高逆文档频率(IDF)信息承载标记,从而更精准地捕捉答案的语义核心。基于MEG,我们构建了MEG-RAG框架,通过训练多模态重排序器使检索证据与真实标注的语义锚点对齐。该框架基于语义基础而非词元概率分布对高价值内容进行优先处理,显著提升了生成输出的准确性与多模态一致性。在M$^2$RAG基准上的大量实验表明,MEG-RAG始终优于强基线方法,并在不同教师模型间展现出稳健的泛化性能。