Accelerated materials language processing enabled by GPT

Materials language processing (MLP) is one of the key facilitators of materials science research, as it enables the extraction of structured information from massive materials science literature. Prior works suggested high-performance MLP models for text classification, named entity recognition (NER), and extractive question answering (QA), which require complex model architecture, exhaustive fine-tuning and a large number of human-labelled datasets. In this study, we develop generative pretrained transformer (GPT)-enabled pipelines where the complex architectures of prior MLP models are replaced with strategic designs of prompt engineering. First, we develop a GPT-enabled document classification method for screening relevant documents, achieving comparable accuracy and reliability compared to prior models, with only small dataset. Secondly, for NER task, we design an entity-centric prompts, and learning few-shot of them improved the performance on most of entities in three open datasets. Finally, we develop an GPT-enabled extractive QA model, which provides improved performance and shows the possibility of automatically correcting annotations. While our findings confirm the potential of GPT-enabled MLP models as well as their value in terms of reliability and practicability, our scientific methods and systematic approach are applicable to any materials science domain to accelerate the information extraction of scientific literature.

翻译：材料语言处理（MLP）是材料科学研究的关键推动力之一，它能够从海量材料科学文献中提取结构化信息。先前的研究提出了用于文本分类、命名实体识别（NER）和抽取式问答（QA）的高性能MLP模型，这些模型需要复杂的模型架构、详尽的微调以及大量人工标注的数据集。在本研究中，我们开发了基于生成式预训练变换器（GPT）的处理流程，通过策略性设计提示工程来取代先前MLP模型的复杂架构。首先，我们开发了一种GPT驱动的文档分类方法用于筛选相关文献，在仅使用小数据集的情况下，实现了与先前模型相当的准确性和可靠性。其次，针对NER任务，我们设计了以实体为中心的提示，其少样本学习提升了三个开放数据集中大多数实体的性能。最后，我们开发了一种GPT驱动的抽取式QA模型，该模型不仅性能提升，还展示了自动修正标注的可能性。我们的研究结果证实了GPT驱动的MLP模型的潜力及其在可靠性和实用性方面的价值，同时，我们提出的科学方法和系统化方法可应用于任何材料科学领域，以加速科学文献的信息提取。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

《生成式模型: 变分自编码器与扩散模型》，75页ppt，Google DeepMind科学家Ruiqi Gao

专知会员服务

66+阅读 · 2023年6月10日

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【ACL2020】多模态信息抽取，365页ppt

专知会员服务

151+阅读 · 2020年7月6日