Multimodal Retrieval-Augmented Generation (MRAG) addresses key limitations of Multimodal Large Language Models (MLLMs), such as hallucination and outdated knowledge. However, current MRAG systems struggle to distinguish whether retrieved multimodal data truly supports the semantic core of an answer or merely provides superficial relevance. Existing metrics often rely on heuristic position-based confidence, which fails to capture the informational density of multimodal entities. To address this, we propose Multi-modal Evidence Grounding (MEG), a semantic-aware metric that quantifies the contribution of retrieved evidence. Unlike standard confidence measures, MEG utilizes Semantic Certainty Anchoring, focusing on high-IDF information-bearing tokens that better capture the semantic core of the answer. Building on MEG, we introduce MEG-RAG, a framework that trains a multimodal reranker to align retrieved evidence with the semantic anchors of the ground truth. By prioritizing high-value content based on semantic grounding rather than token probability distributions, MEG-RAG improves the accuracy and multimodal consistency of generated outputs. Extensive experiments on the M$^2$RAG benchmark show that MEG-RAG consistently outperforms strong baselines and demonstrates robust generalization across different teacher models.
翻译:多模态检索增强生成(MRAG)解决了多模态大语言模型(MLLMs)在幻觉和知识过时等方面的关键局限。然而,当前MRAG系统难以区分检索到的多模态数据是否真正支撑答案的语义核心,抑或仅提供表层相关性。现有指标通常依赖启发式位置置信度,这种方法未能捕捉多模态实体的信息密度。为此,我们提出多模态证据根基(MEG),这是一种语义感知型指标,可量化检索证据的贡献。与标准置信度度量不同,MEG采用语义确定性锚定机制,聚焦于高逆文档频率(IDF)的信息承载标记,从而更精确地捕捉答案的语义核心。基于MEG,我们构建MEG-RAG框架,通过训练多模态重排序器将检索证据与真实答案的语义锚点对齐。该框架基于语义根基而非标记概率分布优先选择高价值内容,从而提升生成结果的准确性和多模态一致性。在M²RAG基准上的广泛实验表明,MEG-RAG持续优于强基线方法,并在不同教师模型上展现出稳健的泛化能力。