Large language models are a form of artificial intelligence systems whose primary knowledge consists of the statistical patterns, semantic relationships, and syntactical structures of language1. Despite their limited forms of "knowledge", these systems are adept at numerous complex tasks including creative writing, storytelling, translation, question-answering, summarization, and computer code generation. However, they have yet to demonstrate advanced applications in natural science. Here we show how large language models can perform scientific synthesis, inference, and explanation. We present a method for using general-purpose large language models to make inferences from scientific datasets of the form usually associated with special-purpose machine learning algorithms. We show that the large language model can augment this "knowledge" by synthesizing from the scientific literature. When a conventional machine learning system is augmented with this synthesized and inferred knowledge it can outperform the current state of the art across a range of benchmark tasks for predicting molecular properties. This approach has the further advantage that the large language model can explain the machine learning system's predictions. We anticipate that our framework will open new avenues for AI to accelerate the pace of scientific discovery.
翻译:大型语言模型是一种人工智能系统,其主要知识由语言的统计模式、语义关系和句法结构构成¹。尽管这些系统的“知识”形式有限,但它们擅长执行众多复杂任务,包括创意写作、故事叙述、翻译、问答、摘要和计算机代码生成。然而,它们尚未在自然科学中展现出高级应用。在此,我们展示了大型语言模型如何能够进行科学综合、推理和解释。我们提出了一种方法,利用通用型大型语言模型从通常与专用机器学习算法关联的科学数据集中进行推理。我们表明,大型语言模型可通过综合科学文献来增强这种“知识”。当传统机器学习系统辅以这种综合与推断的知识时,其在一系列预测分子性质的基准任务上的表现可超越当前最先进水平。该方法还具有另一优势:大型语言模型能够解释机器学习系统的预测。我们预计,我们的框架将为人工智能加速科学发现开辟新途径。