Accurate and comprehensive material databases extracted from research papers are critical for materials science and engineering but require significant human effort to develop. In this paper we present a simple method of extracting materials data from full texts of research papers suitable for quickly developing modest-sized databases. The method requires minimal to no coding, prior knowledge about the extracted property, or model training, and provides high recall and almost perfect precision in the resultant database. The method is fully automated except for one human-assisted step, which typically requires just a few hours of human labor. The method builds on top of natural language processing and large general language models but can work with almost any such model. The language models GPT-3/3.5, bart and DeBERTaV3 are evaluated here for comparison. We provide a detailed detailed analysis of the methods performance in extracting bulk modulus data, obtaining up to 90% precision at 96% recall, depending on the amount of human effort involved. We then demonstrate the methods broader effectiveness by developing a database of critical cooling rates for metallic glasses.
翻译:摘要:从研究论文中提取准确且全面的材料数据库对材料科学与工程至关重要,但开发此类数据库需要大量人力投入。本文提出一种简单方法,能从研究论文全文中提取材料数据,适用于快速构建中等规模的数据库。该方法几乎无需编程、无需对被提取属性的先验知识,也无需模型训练,且能在最终数据库中实现高召回率和近乎完美的精确率。除一个需人工辅助的步骤(通常仅需数小时人力投入)外,该方法完全自动化。该方法基于自然语言处理与大型通用语言模型,但可与几乎所有此类模型兼容。本文对GPT-3/3.5、bart及DeBERTaV3等语言模型进行了对比评估。我们详细分析了该方法在提取体积模量数据时的性能表现:根据人力投入程度的不同,该方法可实现最高90%的精确率与96%的召回率。最后,我们通过构建金属玻璃临界冷却速率数据库,进一步证明了该方法的广泛适用性。