PRISMM-Bench: A Benchmark of Peer-Review Grounded Multimodal Inconsistencies

Large Multimodal Models (LMMs) are increasingly applied to scientific research, yet it remains unclear whether they can reliably understand and reason over the multimodal complexity of papers. A central challenge lies in detecting and resolving inconsistencies across text, figures, tables, and equations, issues that are often subtle, domain-specific, and ultimately undermine clarity, reproducibility, and trust. Existing benchmarks overlook this issue, either isolating single modalities or relying on synthetic errors that fail to capture real-world complexity. We introduce PRISMM-Bench (Peer-Review-sourced Inconsistency Set for Multimodal Models), the first benchmark grounded in real reviewer-flagged inconsistencies in scientific papers. Through a multi-stage pipeline of review mining, LLM-assisted filtering and human verification, we curate 384 inconsistencies from 353 papers. Based on this set, we design three tasks, namely inconsistency identification, remedy and pair matching, which assess a model's capacity to detect, correct, and reason over inconsistencies across different modalities. Furthermore, to address the notorious problem of choice-only shortcuts in multiple-choice evaluation, where models exploit answer patterns without truly understanding the question, we further introduce structured JSON-based answer representations that minimize linguistic biases by reducing reliance on superficial stylistic cues. We benchmark 21 leading LMMs, including large open-weight models (GLM-4.5V 106B, InternVL3 78B) and proprietary models (Gemini 2.5 Pro, GPT-5 with high reasoning). Results reveal strikingly low performance (27.8-53.9\%), underscoring the challenge of multimodal scientific reasoning and motivating progress towards trustworthy scientific assistants.

翻译：大型多模态模型（LMMs）正日益应用于科学研究，但它们是否能可靠地理解并推理论文中的多模态复杂性，目前尚不明确。一个核心挑战在于检测和解决文本、图表、表格及公式之间的不一致性，这些问题通常很微妙、具有领域特异性，并最终会损害清晰度、可重复性和可信度。现有基准忽略了这一问题，要么孤立地处理单一模态，要么依赖于无法捕捉真实世界复杂性的合成错误。我们提出了PRISMM-Bench（基于同行评审的多模态模型不一致性数据集），这是首个基于科学论文中真实审稿人指出的不一致性构建的基准。通过一个包含评审挖掘、LLM辅助过滤和人工验证的多阶段流程，我们从353篇论文中收集了384个不一致性案例。基于此数据集，我们设计了三个任务，即不一致性识别、修正和配对匹配，以评估模型在不同模态间检测、纠正和推理不一致性的能力。此外，针对多项选择评估中臭名昭著的“仅选择捷径”问题——即模型利用答案模式而非真正理解问题——我们进一步引入了基于JSON的结构化答案表示方法，通过减少对表面风格线索的依赖，最大限度地降低语言偏见。我们对21个领先的LMMs进行了基准测试，包括大型开源模型（GLM-4.5V 106B, InternVL3 78B）和专有模型（Gemini 2.5 Pro, 具备高推理能力的GPT-5）。结果显示其性能极低（27.8-53.9%），这突显了多模态科学推理的挑战性，并激励我们朝着构建可信赖的科学助手方向取得进展。