Three paradigmatic forms of inference recur across scientific reasoning: deduction, induction, and causal abduction. Reliably evaluating LLMs on these in scientific settings is currently out of reach: scientific benchmarks built on human annotations are costly and lack mechanistic ground truth, while synthetic logical-reasoning benchmarks do not resemble real scientific documents. We introduce SciR, a benchmark that combines multi-paradigm reasoning with controllable scientific rendering, anchored on three paradigmatic scientific problems. Tasks are generated from formal objects (deduction tree, inductive rule hypothesis, causal graph) to guarantee verifiable answers, then rendered into multi-document scientific discourse via per-track domain-tuned genres. The construction lets us independently vary two difficulty axes: how hard it is to extract the key information needed for inference, and how hard the principled inference itself is. We test six models. Both axes hurt every model, and their effects compound. The rendering even hurts neurosymbolic pipelines, which hand inference to a verified solver. The two axes yield a per-model extraction-vs-inference profile: for instance, reasoning models like deepseek-r1 mostly surpass non-reasoning instruct models on the inference axis. To our knowledge, SciR is the first multi-paradigm scientific-reasoning benchmark with parametric control on both extraction and inference difficulty.
翻译:科学推理中反复出现三种典型推理形式:演绎、归纳和因果溯因。当前在科学场景中可靠评估大语言模型仍难以实现:基于人工标注的科学基准成本高昂且缺乏机制层面的真实基准,而合成逻辑推理基准又与真实科学文档存在差异。我们提出SciR基准,该基准将多范式推理与可控科学呈现相结合,以三个典型科学问题为核心。任务通过形式化对象(演绎树、归纳规则假设、因果图)生成,确保答案可验证,随后通过面向各领域的调优体裁呈现为多文档科学论述。该构建方法使我们能够独立调控两个难度维度:推理所需关键信息提取的难度,以及原则性推理本身的难度。我们测试了六个模型,发现两个维度均对模型性能产生负面影响,且影响具有叠加效应。即便将推理任务交由已验证求解器处理的神经符号管道也受到呈现方式的影响。这两个维度形成了每类模型的提取-推理能力画像:例如,deepseek-r1等推理模型在推理维度上普遍超越非推理型指令模型。据我们所知,SciR是首个对提取难度和推理难度实现参数化控制的多范式科学推理基准。