MiRAGE: A Multiagent Framework for Generating Multimodal Multihop Question-Answer Dataset for RAG Evaluation

The rapid evolution of Retrieval-Augmented Generation (RAG) toward multimodal, high-stakes enterprise applications has outpaced the development of domain specific evaluation benchmarks. Existing datasets often rely on general-domain corpora or purely textual retrieval, failing to capture the complexity of specialized technical documents where information is inextricably multimodal and reasoning requires synthesizing disjoint evidence. We address this gap by introducing MiRAGE, a Multiagent framework for RAG systems Evaluation, that leverages a collaborative swarm of specialized agents to generate verified, domain-specific, multimodal, and multi-hop Question-Answer datasets. MiRAGE orchestrates a swarm of specialized agents: a recursive context optimization loop to aggregate scattered evidence, an adversarial verifier agent to guarantee factual grounding, and an agent to recognize the expert persona and the relevant domain to mimic expert cognitive workflows. Extensive empirical evaluation across four distinct domains (regulations, finance, quantitative biology, and journalism) demonstrates that MiRAGE generates datasets with significantly higher reasoning complexity (>2.3 average hops) and factual faithfulness. Our ablation studies point that MiRAGE can be powered by LLMs if textual descriptions of the images are available. Visual grounding still remains a frontier. By automating the creation of gold standard evaluation datasets that reflect the latent thematic structure of proprietary corpora, MiRAGE provides the necessary infrastructure to rigorously benchmark the next generation information retrieval systems.

翻译：检索增强生成（RAG）技术正快速向多模态、高风险的企业应用演进，其发展速度已超越了领域特定评估基准的建设。现有数据集通常依赖于通用领域语料库或纯文本检索，未能捕捉专业技术文档的复杂性，此类文档中的信息本质上是多模态的，且推理过程需要综合分散的证据。为弥补这一不足，我们提出了MiRAGE，一个用于RAG系统评估的多智能体框架。该框架利用一组协作的专用智能体，生成经过验证的、领域特定的、多模态且多跳的问答数据集。MiRAGE协调多个专用智能体：一个递归上下文优化循环用于聚合分散的证据，一个对抗性验证智能体用于保证事实基础，以及一个识别专家角色和相关领域以模拟专家认知工作流的智能体。在四个不同领域（法规、金融、定量生物学和新闻业）进行的广泛实证评估表明，MiRAGE生成的数据集具有显著更高的推理复杂度（平均跳数>2.3）和事实忠实度。我们的消融研究表明，若图像的文本描述可用，MiRAGE可由大语言模型驱动。视觉基础仍是一个前沿挑战。通过自动创建反映专有语料库潜在主题结构的黄金标准评估数据集，MiRAGE为严格基准测试下一代信息检索系统提供了必要的基础设施。