Large Language Models are Zero Shot Hypothesis Proposers

Significant scientific discoveries have driven the progress of human civilisation. The explosion of scientific literature and data has created information barriers across disciplines that have slowed the pace of scientific discovery. Large Language Models (LLMs) hold a wealth of global and interdisciplinary knowledge that promises to break down these information barriers and foster a new wave of scientific discovery. However, the potential of LLMs for scientific discovery has not been formally explored. In this paper, we start from investigating whether LLMs can propose scientific hypotheses. To this end, we construct a dataset consist of background knowledge and hypothesis pairs from biomedical literature. The dataset is divided into training, seen, and unseen test sets based on the publication date to control visibility. We subsequently evaluate the hypothesis generation capabilities of various top-tier instructed models in zero-shot, few-shot, and fine-tuning settings, including both closed and open-source LLMs. Additionally, we introduce an LLM-based multi-agent cooperative framework with different role designs and external tools to enhance the capabilities related to generating hypotheses. We also design four metrics through a comprehensive review to evaluate the generated hypotheses for both ChatGPT-based and human evaluations. Through experiments and analyses, we arrive at the following findings: 1) LLMs surprisingly generate untrained yet validated hypotheses from testing literature. 2) Increasing uncertainty facilitates candidate generation, potentially enhancing zero-shot hypothesis generation capabilities. These findings strongly support the potential of LLMs as catalysts for new scientific discoveries and guide further exploration.

翻译：重大的科学发现推动了人类文明的进步。科学文献和数据的爆炸式增长造成了跨学科的信息壁垒，从而减缓了科学发现的步伐。大型语言模型（LLMs）拥有丰富的全球及跨学科知识，有望打破这些信息壁垒，并催生新一轮的科学发现。然而，LLMs在科学发现方面的潜力尚未得到系统性的探索。在本文中，我们首先研究LLMs是否能够提出科学假设。为此，我们构建了一个包含生物医学文献中背景知识与假设对的数据集。该数据集根据发表时间分为训练集、可见集和不可见测试集，以控制信息可见性。随后，我们在零样本、少样本和微调等设定下，评估了多种顶尖指令型模型（包括闭源和开源的LLMs）的假设生成能力。此外，我们引入了一个基于LLM的多智能体协作框架，通过不同的角色设计和外部工具来增强与假设生成相关的能力。我们还通过全面文献回顾设计了四个评估指标，用于在ChatGPT评估和人工评估中对生成的假设进行评价。通过实验与分析，我们得出以下发现：1）LLMs能够令人惊讶地从测试文献中生成未经训练但经验证的假设。2）增加不确定性有助于候选假设生成，并可能提升零样本假设生成能力。这些发现有力地支持了LLMs作为新科学发现催化剂的潜力，并为后续探索提供了指引。