Multimodal large language models (MLLMs) are increasingly used for real-world tasks involving multi-step reasoning and long-form generation, where reliability requires grounding model outputs in heterogeneous input sources and verifying individual factual claims. However, existing multimodal grounding benchmarks and evaluation methods focus on simplified, observation-based scenarios or limited modalities and fail to assess attribution in complex multimodal reasoning. We introduce MuRGAt (Multimodal Reasoning with Grounded Attribution), a benchmark for evaluating fact-level multimodal attribution in settings that require reasoning beyond direct observation. Given inputs spanning video, audio, and other modalities, MuRGAt requires models to generate answers with explicit reasoning and precise citations, where each citation specifies both modality and temporal segments. To enable reliable assessment, we introduce an automatic evaluation framework that strongly correlates with human judgments. Benchmarking with human and automated scores reveals that even strong MLLMs frequently hallucinate citations despite correct reasoning. Moreover, we observe a key trade-off: increasing reasoning depth or enforcing structured grounding often degrades accuracy, highlighting a significant gap between internal reasoning and verifiable attribution.
翻译:多模态大语言模型(MLLMs)正日益应用于涉及多步推理和长文本生成的实际任务中,其可靠性要求将模型输出建立在异构输入源的基础上,并验证各个事实性主张。然而,现有的多模态基础基准和评估方法侧重于简化的、基于观察的场景或有限的模态,未能评估复杂多模态推理中的归因。我们提出了MuRGAt(基于归因的多模态推理),这是一个用于评估需要超越直接观察的推理场景中事实级多模态归因的基准。给定涵盖视频、音频和其他模态的输入,MuRGAt要求模型生成带有明确推理和精确引用的答案,其中每个引用需同时指定模态和时间片段。为了实现可靠的评估,我们引入了一个与人类判断高度相关的自动评估框架。通过人工和自动评分进行基准测试发现,即使强大的MLLMs也经常在推理正确的情况下产生虚假引用。此外,我们观察到一个关键权衡:增加推理深度或强制结构化基础往往会降低准确性,这突显了内部推理与可验证归因之间的显著差距。