Understanding scientific papers requires more than answering isolated questions or summarizing content. It involves an integrated reasoning process that grounds textual and visual information, interprets experimental evidence, synthesizes information across sources, and critically evaluates scientific claims. However, existing benchmarks typically assess these abilities in isolation, making it difficult to evaluate scientific paper understanding as a unified set of interacting cognitive abilities. In this work, we introduce PAPERMIND, a benchmark designed to evaluate integrated and agent-oriented scientific reasoning over research papers. PAPERMIND is constructed from real scientific papers across seven domains, including agriculture, biology, chemistry, computer science, medicine, physics, and economics. It comprises four complementary task families that collectively operationalize distinct cognitive facets of scientific paper reasoning, including multimodal grounding, experimental interpretation, cross-source evidence reasoning, and critical assessment. By analyzing model behavior across multiple tasks, PAPERMIND enables a diagnostic evaluation of integrated scientific reasoning behaviors that are difficult to assess through isolated task evaluations. Extensive experiments on both opensource and closed-source multimodal LLMs reveal consistent performance gaps across tasks, highlighting persistent challenges in integrated scientific reasoning and critique. Our benchmark and dataset are available at https:// github.com/Yanjun-Zhao/PaperMind.
翻译:理解科学论文不仅需要回答孤立的问题或总结内容,还需要整合文本与视觉信息、解读实验证据、综合不同来源的信息,并对科学主张进行批判性评估。然而,现有基准测试通常孤立地评估这些能力,难以将科学论文理解评估为一组相互作用的认知能力。本文提出PAPERMIND基准,旨在评估研究论文的整合性与面向智能体的科学推理能力。PAPERMIND基于涵盖农业、生物学、化学、计算机科学、医学、物理学和经济学七个领域的真实科学论文构建。该基准包含四个互补的任务家族,共同操作化科学论文推理的不同认知维度,包括多模态基础、实验解读、跨源证据推理与批判性评估。通过跨任务分析模型行为,PAPERMIND能够对难以通过孤立任务评估的整合性科学推理行为进行诊断性评估。在开源与闭源多模态大语言模型上的广泛实验揭示了模型在各项任务间存在一致的性能差距,凸显了整合性科学推理与批判中存在的持续挑战。我们的基准与数据集可在https://github.com/Yanjun-Zhao/PaperMind获取。