Multimodal Retrieval-Augmented Generation (Visual RAG) significantly advances question answering by integrating visual and textual evidence. Yet, current evaluations fail to systematically account for query difficulty and ambiguity. We propose MRAG-Suite, a diagnostic evaluation platform integrating diverse multimodal benchmarks (WebQA, Chart-RAG, Visual-RAG, MRAG-Bench). We introduce difficulty-based and ambiguity-aware filtering strategies, alongside MM-RAGChecker, a claim-level diagnostic tool. Our results demonstrate substantial accuracy reductions under difficult and ambiguous queries, highlighting prevalent hallucinations. MM-RAGChecker effectively diagnoses these issues, guiding future improvements in Visual RAG systems.
翻译:多模态检索增强生成(Visual RAG)通过整合视觉与文本证据,显著推进了问答系统的能力。然而,现有评估方法未能系统性地考虑查询的难度与歧义性。我们提出了MRAG-Suite,一个集成了多样化多模态基准(WebQA、Chart-RAG、Visual-RAG、MRAG-Bench)的诊断评估平台。我们引入了基于难度的过滤策略与歧义感知过滤策略,并提出了MM-RAGChecker——一个声明级别的诊断工具。实验结果表明,在困难与歧义查询下,系统准确率显著下降,突显了普遍存在的幻觉问题。MM-RAGChecker能够有效诊断这些问题,为未来Visual RAG系统的改进提供指导。