Automated knowledge extraction from scientific literature can potentially accelerate materials discovery. We have investigated an approach for extracting synthesis protocols for reticular materials from scientific literature using large language models (LLMs). To that end, we introduce a Knowledge Extraction Pipeline (KEP) that automatizes LLM-assisted paragraph classification and information extraction. By applying prompt engineering with in-context learning (ICL) to a set of open-source LLMs, we demonstrate that LLMs can retrieve chemical information from PDF documents, without the need for fine-tuning or training and at a reduced risk of hallucination. By comparing the performance of five open-source families of LLMs in both paragraph classification and information extraction tasks, we observe excellent model performance even if only few example paragraphs are included in the ICL prompts. The results show the potential of the KEP approach for reducing human annotations and data curation efforts in automated scientific knowledge extraction.
翻译:科学文献的自动化知识提取有望加速材料发现进程。本研究探索了一种利用大型语言模型从科学文献中提取网状材料合成方案的方法。为此,我们提出了知识提取流水线,该流水线实现了LLM辅助的段落分类与信息提取的自动化处理。通过对一组开源LLM应用基于上下文学习的提示工程,我们证明LLM能够从PDF文档中检索化学信息,且无需微调或训练过程,同时降低了产生幻觉的风险。通过比较五个开源LLM系列在段落分类和信息提取任务中的表现,我们发现即使ICL提示中仅包含少量示例段落,模型仍能表现出优异的性能。这些结果表明KEP方法在减少自动化科学知识提取中的人工标注和数据整理工作方面具有巨大潜力。