Scientific discoveries often hinge on synthesizing decades of research, a task that potentially outstrips human information processing capacities. Large language models (LLMs) offer a solution. LLMs trained on the vast scientific literature could potentially integrate noisy yet interrelated findings to forecast novel results better than human experts. To evaluate this possibility, we created BrainBench, a forward-looking benchmark for predicting neuroscience results. We find that LLMs surpass experts in predicting experimental outcomes. BrainGPT, an LLM we tuned on the neuroscience literature, performed better yet. Like human experts, when LLMs were confident in their predictions, they were more likely to be correct, which presages a future where humans and LLMs team together to make discoveries. Our approach is not neuroscience-specific and is transferable to other knowledge-intensive endeavors.
翻译:科学发现往往依赖于对数十年来研究成果的综合分析,这一任务可能超出了人类信息处理能力的极限。大型语言模型(LLMs)为此提供了解决方案。基于海量科学文献训练的大型语言模型,有望整合大量存在噪声但相互关联的研究发现,从而比人类专家更准确地预测新的实验结果。为验证这一可能性,我们创建了BrainBench——一个用于预测神经科学实验结果的超前性基准测试。我们发现,大型语言模型在预测实验成果方面超越了专家水平。我们基于神经科学文献微调的LLM模型BrainGPT表现更为优异。与人类专家相似,当大型语言模型对其预测结果具有高置信度时,其预测准确性也相应更高,这预示着未来人类与大型语言模型协同合作推动科学发现的前景。我们的研究方法不仅限于神经科学领域,可推广至其他知识密集型科研活动中。