Scientific discovery plays a pivotal role in advancing human society, and recent progress in large language models (LLMs) suggests their potential to accelerate this process. However, it remains unclear whether LLMs can autonomously generate novel and valid hypotheses in chemistry. In this work, we investigate whether LLMs can discover high-quality chemistry hypotheses given only a research background-comprising a question and/or a survey-without restriction on the domain of the question. We begin with the observation that hypothesis discovery is a seemingly intractable task. To address this, we propose a formal mathematical decomposition grounded in a fundamental assumption: that most chemistry hypotheses can be composed from a research background and a set of inspirations. This decomposition leads to three practical subtasks-retrieving inspirations, composing hypotheses with inspirations, and ranking hypotheses - which together constitute a sufficient set of subtasks for the overall scientific discovery task. We further develop an agentic LLM framework, MOOSE-Chem, that is a direct implementation of this mathematical decomposition. To evaluate this framework, we construct a benchmark of 51 high-impact chemistry papers published and online after January 2024, each manually annotated by PhD chemists with background, inspirations, and hypothesis. The framework is able to rediscover many hypotheses with high similarity to the groundtruth, successfully capturing the core innovations-while ensuring no data contamination since it uses an LLM with knowledge cutoff date prior to 2024. Finally, based on LLM's surprisingly high accuracy on inspiration retrieval, a task with inherently out-of-distribution nature, we propose a bold assumption: that LLMs may already encode latent scientific knowledge associations not yet recognized by humans.
翻译:科学发现对推动人类社会进步起着关键作用,而大语言模型(LLMs)的最新进展显示出其加速这一进程的潜力。然而,LLMs能否在化学领域自主生成新颖且有效的假说仍不明确。本研究旨在探究,当仅给定研究背景——即一个问题及/或一篇综述——且不限制问题所属领域时,LLMs能否发现高质量的化学假说。我们首先观察到假说发现是一项看似棘手的任务。为解决此问题,我们提出了一种基于基本假设的形式化数学分解:大多数化学假说可由研究背景与一组灵感组合而成。该分解衍生出三个实践性子任务——检索灵感、结合灵感构建假说以及假说排序——这些子任务共同构成了完成整体科学发现任务的充分子任务集。我们进一步开发了一个代理式大语言模型框架MOOSE-Chem,该框架直接实现了这一数学分解。为评估该框架,我们构建了一个包含51篇2024年1月后发表的高影响力化学论文的基准数据集,每篇论文均由化学博士人工标注了背景、灵感与假说。该框架能够重新发现许多与真实假说高度相似的假说,成功捕捉了核心创新点——同时确保不存在数据污染,因其使用的LLM知识截止日期早于2024年。最后,基于LLM在灵感检索任务中出人意料的高准确率(该任务本质上面临分布外挑战),我们提出一个大胆假设:LLMs可能已编码了尚未被人类认知的潜在科学知识关联。