Large language models (LLMs) have demonstrated potential in assisting scientific research, yet their ability to discover high-quality research hypotheses remains unexamined due to the lack of a dedicated benchmark. To address this gap, we introduce the first large-scale benchmark for evaluating LLMs with a near-sufficient set of sub-tasks of scientific discovery: inspiration retrieval, hypothesis composition, and hypothesis ranking. We develop an automated framework that extracts critical components - research questions, background surveys, inspirations, and hypotheses - from scientific papers across 12 disciplines, with expert validation confirming its accuracy. To prevent data contamination, we focus exclusively on papers published in 2024, ensuring minimal overlap with LLM pretraining data. Our evaluation reveals that LLMs perform well in retrieving inspirations, an out-of-distribution task, suggesting their ability to surface novel knowledge associations. This positions LLMs as "research hypothesis mines", capable of facilitating automated scientific discovery by generating innovative hypotheses at scale with minimal human intervention.
翻译:大语言模型(LLMs)已展现出辅助科学研究的潜力,但由于缺乏专门的评估基准,其在发现高质量研究假设方面的能力尚未得到检验。为填补这一空白,我们引入了首个大规模基准,通过一组近乎完备的科学发现子任务来评估大语言模型:启发检索、假设构建和假设排序。我们开发了一个自动化框架,从12个学科的科学论文中提取关键组成部分——研究问题、背景综述、启发和假设,并通过专家验证确认了其准确性。为防止数据污染,我们仅聚焦于2024年发表的论文,确保与大语言模型预训练数据的重叠最小。我们的评估表明,大语言模型在检索启发(一项分布外任务)方面表现良好,这暗示了其发掘新颖知识关联的能力。这使大语言模型成为“研究假设矿藏”,能够通过以最小人工干预大规模生成创新假设,促进自动化科学发现。