With the increasing demand for step-wise, cross-modal, and knowledge-grounded reasoning, multimodal large language models (MLLMs) are evolving beyond the traditional fixed retrieve-then-generate paradigm toward more sophisticated agentic multimodal retrieval-augmented generation (MM-RAG). Existing benchmarks, however, mainly focus on simplified QA with short retrieval chains, leaving adaptive planning and multimodal reasoning underexplored. We present MC-Search, the first benchmark for agentic MM-RAG with long, step-wise annotated reasoning chains spanning five representative reasoning structures. Each example specifies sub-questions, retrieval modalities, supporting facts, and intermediate answers, with fidelity ensured by HAVE (Hop-wise Attribution and Verification of Evidence), resulting in 3,333 high-quality examples averaging 3.7 hops. Beyond answer accuracy, MC-Search introduces new process-level metrics for reasoning quality, stepwise retrieval and planning accuracy. By developing a unified agentic MM-RAG pipeline, we benchmark six leading MLLMs and reveal systematic issues such as over- and under-retrieval and modality-misaligned planning. Finally, we introduce Search-Align, a process-supervised fine-tuning framework leveraging verified reasoning chains, showing that our data not only enables faithful evaluation but also improves planning and retrieval fidelity in open-source MLLMs.
翻译:随着对逐步、跨模态和知识驱动推理的需求日益增长,多模态大语言模型(MLLMs)正在超越传统的固定式“检索-生成”范式,向更复杂的智能体多模态检索增强生成(MM-RAG)演进。然而,现有基准主要关注具有短检索链的简化问答,对自适应规划与多模态推理的探索不足。我们提出了MC-Search,这是首个针对具有长、逐步标注推理链的智能体MM-RAG的基准,涵盖五种代表性推理结构。每个示例都明确了子问题、检索模态、支持事实和中间答案,其保真度通过HAVE(基于跳步的证据归因与验证)方法确保,最终构建了包含3,333个高质量示例的数据集,平均推理步长(跳数)为3.7。除了答案准确性,MC-Search引入了新的过程级指标,用于评估推理质量、逐步检索与规划准确性。通过开发一个统一的智能体MM-RAG流程,我们对六个领先的MLLMs进行了基准测试,揭示了系统性缺陷,如检索过度与不足以及模态规划错位。最后,我们提出了Search-Align,一个利用已验证推理链进行过程监督的微调框架,结果表明我们的数据不仅支持可靠的评估,还能提升开源MLLMs的规划与检索保真度。