Large language models are increasingly used to answer and verify scientific claims, yet existing evaluations typically assume that a model must always produce a definitive answer. In scientific settings, however, unsupported or uncertain conclusions can be more harmful than abstaining. We study this problem through an abstention-aware verification framework that decomposes scientific claims into minimal conditions, audits each condition against available evidence using natural language inference (NLI), and selectively decides whether to support, refute, or abstain. We evaluate this framework across two complementary scientific benchmarks: SciFact and PubMedQA, covering both closed-book and open-domain evidence settings. Experiments are conducted with six diverse language models, including encoder-decoder, open-weight chat models, and proprietary APIs. Across all benchmarks and models, we observe that raw accuracy varies only modestly across architectures, while abstention plays a critical role in controlling error. In particular, confidence-based abstention substantially reduces risk at moderate coverage levels, even when absolute accuracy improvements are limited. Our results suggest that in scientific reasoning tasks, the primary challenge is not selecting a single best model, but rather determining when available evidence is sufficient to justify an answer. This work highlights abstention-aware evaluation as a practical and model-agnostic lens for assessing scientific reliability, and provides a unified experimental basis for future work on selective reasoning in scientific domains. Code is available at https://github.com/sabdaljalil2000/ai4science .
翻译:大型语言模型越来越多地被用于回答和验证科学主张,然而现有的评估通常假设模型必须始终给出明确答案。然而在科学场景中,给出缺乏支持或不确定的结论可能比选择弃权更有害。我们通过一个具备弃权意识的验证框架来研究此问题,该框架将科学主张分解为最小条件,使用自然语言推理(NLI)依据现有证据审核每个条件,并有选择地决定是支持、反驳还是弃权。我们在两个互补的科学基准测试(SciFact 和 PubMedQA)上评估该框架,涵盖闭卷和开放域证据两种设置。实验使用了六种不同的语言模型,包括编码器-解码器架构、开放权重的聊天模型以及专有 API。在所有基准测试和模型中,我们观察到原始准确率在不同架构间仅有适度变化,而弃权在控制错误方面起着关键作用。特别是,基于置信度的弃权能在中等覆盖水平下显著降低风险,即使绝对准确率的提升有限。我们的结果表明,在科学推理任务中,主要挑战并非选择单一的最佳模型,而是判断现有证据是否足以支持给出答案。这项工作强调,具备弃权意识的评估是评估科学可靠性的一种实用且与模型无关的视角,并为未来科学领域选择性推理的研究提供了统一的实验基础。代码可在 https://github.com/sabdaljalil2000/ai4science 获取。