Large language models (LLMs) hold considerable potential for advancing scientific discovery, yet systematic assessment of their dynamic reasoning in real-world research remains limited. Current scientific evaluation benchmarks predominantly rely on static, single-turn Question Answering (QA) formats, which are inadequate for measuring model performance in complex scientific tasks that require multi-step iteration and experimental interaction. To address this gap, we introduce MolQuest, a novel agent-based evaluation framework for molecular structure elucidation built upon authentic chemical experimental data. Unlike existing datasets, MolQuest formalizes molecular structure elucidation as a multi-turn interactive task, requiring models to proactively plan experimental steps, integrate heterogeneous spectral sources (e.g., NMR, MS), and iteratively refine structural hypotheses. This framework systematically evaluates LLMs' abductive reasoning and strategic decision-making abilities within a vast and complex chemical space. Empirical results reveal that contemporary frontier models exhibit significant limitations in authentic scientific scenarios: notably, even state-of-the-art (SOTA) models achieve an accuracy of only approximately 50%, while the performance of most other models remains below the 30% threshold. This work provides a reproducible and extensible framework for science-oriented LLM evaluation, our findings highlight the critical gap in current LLMs' strategic scientific reasoning, setting a clear direction for future research toward AI that can actively participate in the scientific process.
翻译:大语言模型在推动科学发现方面具有巨大潜力,但对其在真实研究场景中动态推理能力的系统评估仍然有限。当前科学评估基准主要依赖静态单轮问答形式,难以衡量模型在需要多步迭代和实验交互的复杂科学任务中的表现。为填补这一空白,我们基于真实化学实验数据提出MolQuest——一种面向分子结构解析的新型智能体评估框架。与现有数据集不同,MolQuest将分子结构解析形式化为多轮交互任务,要求模型主动规划实验步骤,整合异构谱学数据源(如核磁共振波谱、质谱),并通过迭代优化结构假说。该框架系统评估了大语言模型在庞大复杂化学空间中的溯因推理与策略决策能力。实验结果表明,当代前沿模型在真实科学场景中存在显著局限:即便最先进的模型准确率仅为约50%,而大多数其他模型性能低于30%阈值。本工作为面向科学的大语言模型评估提供了可复现、可扩展的框架,研究结果揭示了当前大语言模型在战略性科学推理方面的关键缺陷,为未来实现能主动参与科学过程的人工智能研究指明了方向。