PRISMM-Bench: A Benchmark of Peer-Review Grounded Multimodal Inconsistencies

Large Multimodal Models (LMMs) are increasingly applied to scientific research, yet it remains unclear whether they can reliably understand and reason over the multimodal complexity of papers. A central challenge lies in detecting and resolving inconsistencies across text, figures, tables, and equations, issues that are often subtle, domain-specific, and ultimately undermine clarity, reproducibility, and trust. Existing benchmarks overlook this issue, either isolating single modalities or relying on synthetic errors that fail to capture real-world complexity. We introduce PRISMM-Bench (Peer-Review-sourced Inconsistency Set for Multimodal Models), the first benchmark grounded in real reviewer-flagged inconsistencies in scientific papers. Through a multi-stage pipeline of review mining, LLM-assisted filtering and human verification, we curate 262 inconsistencies from 242 papers. Based on this set, we design three tasks, namely inconsistency identification, remedy and pair matching, which assess a model's capacity to detect, correct, and reason over inconsistencies across different modalities. Furthermore, to address the notorious problem of choice-only shortcuts in multiple-choice evaluation, where models exploit answer patterns without truly understanding the question, we further introduce structured JSON-based answer representations that minimize linguistic biases by reducing reliance on superficial stylistic cues. We benchmark 21 leading LMMs, including large open-weight models (GLM-4.5V 106B, InternVL3 78B) and proprietary models (Gemini 2.5 Pro, GPT-5 with high reasoning). Results reveal strikingly low performance (26.1-54.2%), underscoring the challenge of multimodal scientific reasoning and motivating progress towards trustworthy scientific assistants.

翻译：大型多模态模型（LMMs）正日益应用于科学研究，然而它们能否可靠地理解并推理论文中的多模态复杂性仍不明确。一个核心挑战在于检测和解决文本、图表、表格及公式之间的不一致性，这些问题通常微妙、领域特定，并最终损害清晰度、可复现性和可信度。现有基准忽视了这一问题，要么孤立处理单一模态，要么依赖无法捕捉真实世界复杂性的合成错误。我们提出了PRISMM-Bench（基于同行评审的多模态模型不一致性数据集），这是首个基于科学论文中真实评审标记不一致性的基准。通过一个包含评审挖掘、LLM辅助过滤和人工验证的多阶段流程，我们从242篇论文中收集了262个不一致实例。基于此数据集，我们设计了三个任务，即不一致性识别、修正和配对匹配，以评估模型在不同模态间检测、纠正和推理不一致性的能力。此外，针对多项选择评估中臭名昭著的纯选项捷径问题——即模型利用答案模式而非真正理解问题——我们进一步引入了基于JSON的结构化答案表示，通过减少对表面风格线索的依赖来最小化语言偏见。我们对21个领先的LMMs进行了基准测试，包括大型开源模型（GLM-4.5V 106B, InternVL3 78B）和专有模型（Gemini 2.5 Pro, 具备高推理能力的GPT-5）。结果显示性能极低（26.1-54.2%），突显了多模态科学推理的挑战，并激励了向可信科学助手方向的进展。