Large Language Models as Biomedical Hypothesis Generators: A Comprehensive Evaluation

The rapid growth of biomedical knowledge has outpaced our ability to efficiently extract insights and generate novel hypotheses. Large language models (LLMs) have emerged as a promising tool to revolutionize knowledge interaction and potentially accelerate biomedical discovery. In this paper, we present a comprehensive evaluation of LLMs as biomedical hypothesis generators. We construct a dataset of background-hypothesis pairs from biomedical literature, carefully partitioned into training, seen, and unseen test sets based on publication date to mitigate data contamination. Using this dataset, we assess the hypothesis generation capabilities of top-tier instructed models in zero-shot, few-shot, and fine-tuning settings. To enhance the exploration of uncertainty, a crucial aspect of scientific discovery, we incorporate tool use and multi-agent interactions in our evaluation framework. Furthermore, we propose four novel metrics grounded in extensive literature review to evaluate the quality of generated hypotheses, considering both LLM-based and human assessments. Our experiments yield two key findings: 1) LLMs can generate novel and validated hypotheses, even when tested on literature unseen during training, and 2) Increasing uncertainty through multi-agent interactions and tool use can facilitate diverse candidate generation and improve zero-shot hypothesis generation performance. However, we also observe that the integration of additional knowledge through few-shot learning and tool use may not always lead to performance gains, highlighting the need for careful consideration of the type and scope of external knowledge incorporated. These findings underscore the potential of LLMs as powerful aids in biomedical hypothesis generation and provide valuable insights to guide further research in this area.

翻译：生物医学知识的快速增长已超出我们高效提取洞见并生成新颖假说的能力。大型语言模型（LLMs）已成为一种有望革新知识交互并可能加速生物医学发现的工具。本文对LLMs作为生物医学假说生成器进行了综合评估。我们构建了一个从生物医学文献中提取的背景-假说对数据集，并根据出版日期将其仔细划分为训练集、已见测试集和未见测试集，以减轻数据污染。利用该数据集，我们评估了顶级指令模型在零样本、少样本和微调设置下的假说生成能力。为加强对科学发现关键方面——不确定性的探索，我们在评估框架中整合了工具使用和多智能体交互。此外，基于广泛的文献综述，我们提出了四个新颖的指标来评估生成假说的质量，同时考虑了基于LLM的评估和人工评估。我们的实验得出两个关键发现：1）即使在训练期间未见的文献上进行测试，LLMs也能生成新颖且经过验证的假说；2）通过多智能体交互和工具使用增加不确定性，可以促进多样化候选假说的生成，并提升零样本假说生成性能。然而，我们也观察到，通过少样本学习和工具使用整合额外知识并不总能带来性能提升，这凸显了需要仔细考虑所整合外部知识的类型和范围。这些发现强调了LLMs作为生物医学假说生成强大助手的潜力，并为指导该领域的进一步研究提供了有价值的见解。