MolQuest: A Benchmark for Agentic Evaluation of Abductive Reasoning in Chemical Structure Elucidation

Large language models (LLMs) hold considerable potential for advancing scientific discovery, yet systematic assessment of their dynamic reasoning in real-world research remains limited. Current scientific evaluation benchmarks predominantly rely on static, single-turn Question Answering (QA) formats, which are inadequate for measuring model performance in complex scientific tasks that require multi-step iteration and experimental interaction. To address this gap, we introduce MolQuest, a novel agent-based evaluation framework for molecular structure elucidation built upon authentic chemical experimental data. Unlike existing datasets, MolQuest formalizes molecular structure elucidation as a multi-turn interactive task, requiring models to proactively plan experimental steps, integrate heterogeneous spectral sources (e.g., NMR, MS), and iteratively refine structural hypotheses. This framework systematically evaluates LLMs' abductive reasoning and strategic decision-making abilities within a vast and complex chemical space. Empirical results reveal that contemporary frontier models exhibit significant limitations in authentic scientific scenarios: notably, even state-of-the-art (SOTA) models achieve an accuracy of only approximately 50%, while the performance of most other models remains below the 30% threshold. This work provides a reproducible and extensible framework for science-oriented LLM evaluation, our findings highlight the critical gap in current LLMs' strategic scientific reasoning, setting a clear direction for future research toward AI that can actively participate in the scientific process.

翻译：大语言模型在推动科学发现方面具有巨大潜力，但对其在真实研究场景中动态推理能力的系统评估仍然有限。当前科学评估基准主要依赖静态单轮问答形式，难以衡量模型在需要多步迭代和实验交互的复杂科学任务中的表现。为填补这一空白，我们基于真实化学实验数据提出MolQuest——一种面向分子结构解析的新型智能体评估框架。与现有数据集不同，MolQuest将分子结构解析形式化为多轮交互任务，要求模型主动规划实验步骤，整合异构谱学数据源（如核磁共振波谱、质谱），并通过迭代优化结构假说。该框架系统评估了大语言模型在庞大复杂化学空间中的溯因推理与策略决策能力。实验结果表明，当代前沿模型在真实科学场景中存在显著局限：即便最先进的模型准确率仅为约50%，而大多数其他模型性能低于30%阈值。本工作为面向科学的大语言模型评估提供了可复现、可扩展的框架，研究结果揭示了当前大语言模型在战略性科学推理方面的关键缺陷，为未来实现能主动参与科学过程的人工智能研究指明了方向。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

智能体评判者（Agent-as-a-Judge）研究综述

专知会员服务

37+阅读 · 1月9日

评估大语言模型在科学发现中的作用

专知会员服务

19+阅读 · 2025年12月19日

【AAAI2026】NeSTR：一种用于大型语言模型的神经-符号可溯因框架，用于时间推理

专知会员服务

17+阅读 · 2025年12月10日

从感知到推理：深度思考赋能多模态大语言模型

专知会员服务

25+阅读 · 2025年11月19日