Although gold nanorods have been the subject of much research, the pathways for controlling their shape and thereby their optical properties remain largely heuristically understood. Although it is apparent that the simultaneous presence of and interaction between various reagents during synthesis control these properties, computational and experimental approaches for exploring the synthesis space can be either intractable or too time-consuming in practice. This motivates an alternative approach leveraging the wealth of synthesis information already embedded in the body of scientific literature by developing tools to extract relevant structured data in an automated, high-throughput manner. To that end, we present an approach using the powerful GPT-3 language model to extract structured multi-step seed-mediated growth procedures and outcomes for gold nanorods from unstructured scientific text. GPT-3 prompt completions are fine-tuned to predict synthesis templates in the form of JSON documents from unstructured text input with an overall accuracy of $86\%$. The performance is notable, considering the model is performing simultaneous entity recognition and relation extraction. We present a dataset of 11,644 entities extracted from 1,137 papers, resulting in 268 papers with at least one complete seed-mediated gold nanorod growth procedure and outcome for a total of 332 complete procedures.
翻译:尽管金纳米棒已成为众多研究的焦点,但调控其形貌及相应光学性质的路径在很大程度上仍依赖于经验性理解。虽然合成过程中多种试剂的同时存在及其相互作用显然决定了这些特性,但用于探索合成空间的计算与实验方法在实际操作中要么难以实现,要么耗时过长。这促使我们另辟蹊径,即通过开发自动化、高通量的工具从科学文献中提取相关结构化数据,充分利用已嵌入文献中的丰富合成信息。为此,我们提出了一种方法:利用强大的GPT-3语言模型,从非结构化科学文本中提取金纳米棒的多步种子介导生长流程及结果。通过对GPT-3提示补全进行微调,使其能够从非结构化文本输入中预测JSON格式的合成模板,整体准确率达86%。考虑到该模型需同时执行实体识别与关系抽取,这一性能尤为显著。我们构建了一个数据集,从1137篇论文中提取了11644个实体,从中筛选出268篇论文,这些论文至少包含一个完整的种子介导金纳米棒生长流程及结果,共得到332个完整流程。