Large language models (LLMs) have shown potential in assisting scientific research, yet their ability to discover high-quality research hypotheses remains unexamined due to the lack of a dedicated benchmark. To address this gap, we introduce the first large-scale benchmark for evaluating LLMs on a sufficient set of scientific discovery sub-tasks-inspiration retrieval, hypothesis composition, and hypothesis ranking-where sufficient means that perfectly solving these sub-tasks perfectly solves the overall discovery task. We develop an automated LLM-based framework that extracts critical components-research questions, background surveys, inspirations, and hypotheses-from papers across 12 disciplines, with expert validation confirming its accuracy. To prevent data contamination, we focus exclusively on publications from 2024 onward, ensuring minimal overlap with LLM pretraining data; our automated framework further enables automatic extraction of even more recent papers as LLM pretraining cutoffs advance, supporting scalable and contamination-free automatic renewal of this discovery benchmark. Our evaluation shows that, across disciplines, LLMs excel at inspiration retrieval-an out-of-distribution task-suggesting their ability to surface novel knowledge associations.
翻译:大语言模型(LLMs)在辅助科学研究领域展现出潜力,但由于缺乏专用基准,其发现高质量研究假设的能力仍未得到检验。为填补这一空白,我们首次构建大规模基准,用于评估LLMs在科学发现子任务——灵感检索、假设构建与假设排序——上的充分表现,其中"充分"指完美解决这些子任务即可整体解决发现任务。我们开发了基于LLM的自动化框架,从12个学科论文中提取研究问题、背景综述、灵感与假设等关键要素,并通过专家验证确认其准确性。为防止数据污染,我们专门聚焦2024年及之后发表的论文,确保与LLM预训练数据重叠最小化;该自动化框架还可随LLM预训练截止时间推移自动提取更新论文,支持本发现基准的可扩展、无污染自动更新。评估表明,跨学科场景下LLMs在灵感检索(一种分布外任务)中表现优异,揭示其挖掘新颖知识关联的能力。