Materials language processing (MLP) is one of the key facilitators of materials science research, as it enables the extraction of structured information from massive materials science literature. Prior works suggested high-performance MLP models for text classification, named entity recognition (NER), and extractive question answering (QA), which require complex model architecture, exhaustive fine-tuning and a large number of human-labelled datasets. In this study, we develop generative pretrained transformer (GPT)-enabled pipelines where the complex architectures of prior MLP models are replaced with strategic designs of prompt engineering. First, we develop a GPT-enabled document classification method for screening relevant documents, achieving comparable accuracy and reliability compared to prior models, with only small dataset. Secondly, for NER task, we design an entity-centric prompts, and learning few-shot of them improved the performance on most of entities in three open datasets. Finally, we develop an GPT-enabled extractive QA model, which provides improved performance and shows the possibility of automatically correcting annotations. While our findings confirm the potential of GPT-enabled MLP models as well as their value in terms of reliability and practicability, our scientific methods and systematic approach are applicable to any materials science domain to accelerate the information extraction of scientific literature.
翻译:材料语言处理(MLP)是材料科学研究的关键推动力之一,它能够从海量材料科学文献中提取结构化信息。先前的研究提出了用于文本分类、命名实体识别(NER)和抽取式问答(QA)的高性能MLP模型,这些模型需要复杂的模型架构、详尽的微调以及大量人工标注的数据集。在本研究中,我们开发了基于生成式预训练变换器(GPT)的处理流程,通过策略性设计提示工程来取代先前MLP模型的复杂架构。首先,我们开发了一种GPT驱动的文档分类方法用于筛选相关文献,在仅使用小数据集的情况下,实现了与先前模型相当的准确性和可靠性。其次,针对NER任务,我们设计了以实体为中心的提示,其少样本学习提升了三个开放数据集中大多数实体的性能。最后,我们开发了一种GPT驱动的抽取式QA模型,该模型不仅性能提升,还展示了自动修正标注的可能性。我们的研究结果证实了GPT驱动的MLP模型的潜力及其在可靠性和实用性方面的价值,同时,我们提出的科学方法和系统化方法可应用于任何材料科学领域,以加速科学文献的信息提取。