We introduce MiRAGE, an evaluation framework for retrieval-augmented generation (RAG) from multimodal sources. As audiovisual media becomes a prevalent source of information online, it is essential for RAG systems to integrate information from these sources into generation. However, existing evaluations for RAG are text-centric, limiting their applicability to multimodal, reasoning intensive settings because they don't verify information against sources. MiRAGE is a claim-centric approach to multimodal RAG evaluation, consisting of InfoF1, evaluating factuality and information coverage, and CiteF1, measuring citation support and completeness. We show that MiRAGE, when applied by humans, strongly aligns with extrinsic quality judgments. We additionally introduce automatic variants of MiRAGE and three prominent TextRAG metrics -- ACLE, ARGUE, and RAGAS -- demonstrating the limitations of text-centric work and laying the groundwork for automatic evaluation. We release open-source implementations and outline how to assess multimodal RAG.
翻译:我们提出了MiRAGE,一个用于评估多模态源检索增强生成(RAG)的框架。随着视听媒体成为线上信息的主要来源,RAG系统必须将这些来源的信息整合到生成过程中。然而,现有的RAG评估方法以文本为中心,由于未验证信息来源,限制了其在多模态、推理密集型场景中的适用性。MiRAGE是一种以声明为中心的多模态RAG评估方法,包含评估事实性与信息覆盖度的InfoF1,以及衡量引用支持与完整性的CiteF1。我们证明,当由人工应用时,MiRAGE与外部质量判断高度一致。我们还引入了MiRAGE的自动变体及三种主流的TextRAG指标——ACLE、ARGUE和RAGAS,揭示了以文本为中心的工作的局限性,并为自动评估奠定了基础。我们开源了实现代码,并概述了如何评估多模态RAG。