The ability to automatically generate accurate protocols for scientific experiments would represent a major step towards the automation of science. Large Language Models (LLMs) have impressive capabilities on a wide range of tasks, such as question answering and the generation of coherent text and code. However, LLMs can struggle with multi-step problems and long-term planning, which are crucial for designing scientific experiments. Moreover, evaluation of the accuracy of scientific protocols is challenging, because experiments can be described correctly in many different ways, require expert knowledge to evaluate, and cannot usually be executed automatically. Here we present an automatic evaluation framework for the task of planning experimental protocols, and we introduce BioProt: a dataset of biology protocols with corresponding pseudocode representations. To measure performance on generating scientific protocols, we use an LLM to convert a natural language protocol into pseudocode, and then evaluate an LLM's ability to reconstruct the pseudocode from a high-level description and a list of admissible pseudocode functions. We evaluate GPT-3 and GPT-4 on this task and explore their robustness. We externally validate the utility of pseudocode representations of text by generating accurate novel protocols using retrieved pseudocode, and we run a generated protocol successfully in our biological laboratory. Our framework is extensible to the evaluation and improvement of language model planning abilities in other areas of science or other areas that lack automatic evaluation.
翻译:自动生成科学实验精确协议的能力将是迈向科学自动化的重要一步。大语言模型在广泛任务中展现出卓越能力,例如问答、连贯文本与代码生成。然而,LLM在处理多步骤问题与长期规划方面存在困难,而这正是设计科学实验的关键。此外,科学协议准确性的评估极具挑战,因为实验可通过多种正确方式描述,需要专业知识进行评估,且通常无法自动执行。本文提出一种用于实验协议规划任务的自动评估框架,并介绍生物协议数据集BioProt及其对应的伪代码表示。为衡量科学协议生成性能,我们利用LLM将自然语言协议转换为伪代码,进而评估LLM根据高层描述和可允许伪代码函数集合重建伪代码的能力。我们在此任务上评估了GPT-3与GPT-4,并探索其鲁棒性。通过使用检索到的伪代码生成准确的新型协议,我们外部验证了文本伪代码表示的实用性,并在生物学实验室成功运行了生成的协议。我们的框架可扩展至其他缺乏自动评估的科学领域或规划能力的语言模型评估与改进中。