Scientific problem solving poses unique challenges for LLMs, requiring both deep domain knowledge and the ability to apply such knowledge through complex reasoning. While automated scientific reasoners hold great promise for assisting human scientists, there is currently no widely adopted holistic benchmark for evaluating scientific reasoning, and few approaches systematically disentangle the distinct roles of knowledge and reasoning in these tasks. To address these gaps, we introduce SciReas, a diverse suite of existing benchmarks for scientific reasoning tasks, and SciReas-Pro, a selective subset that requires more complex reasoning. Our holistic evaluation surfaces insights about scientific reasoning performance that remain hidden when relying on individual benchmarks alone. We then propose KRUX, a probing framework for studying the distinct roles of reasoning and knowledge in scientific tasks. Combining the two, we conduct an in-depth analysis that yields several key findings: (1) Retrieving task-relevant knowledge from model parameters is a critical bottleneck for LLMs in scientific reasoning; (2) Reasoning models consistently benefit from external knowledge added in-context on top of the reasoning enhancement; (3) Enhancing verbalized reasoning improves LLMs' ability to surface task-relevant knowledge.
翻译:科学问题解决对大语言模型提出了独特的挑战,既需要深厚的领域知识,又要求具备运用这些知识进行复杂推理的能力。尽管自动化科学推理器在辅助人类科学家方面前景广阔,但目前尚无被广泛采用的整体性基准来评估科学推理能力,且鲜有方法能系统性地解构知识在这些任务中与推理所发挥的不同作用。为填补这些空白,我们引入了SciReas——一套多样化的现有科学推理任务基准集合,以及SciReas-Pro——一个需要更复杂推理的选择性子集。我们的整体性评估揭示了仅依赖单一基准时无法发现的科学推理性能洞见。随后,我们提出了KRUX,一个用于研究科学任务中推理与知识不同作用的探测框架。结合两者,我们进行了深入分析并得出若干关键发现:(1)从模型参数中检索任务相关知识是大语言模型进行科学推理的关键瓶颈;(2)在推理增强的基础上,推理模型始终受益于上下文添加的外部知识;(3)增强语言化推理能力可提升大语言模型提取任务相关知识的能力。