Mining experimental data from Materials Science literature with Large Language Models: an evaluation study

This study is dedicated to assessing the capabilities of large language models (LLMs) such as GPT-3.5-Turbo, GPT-4, and GPT-4-Turbo in extracting structured information from scientific documents in materials science. To this end, we primarily focus on two critical tasks of information extraction: (i) a named entity recognition (NER) of studied materials and physical properties and (ii) a relation extraction (RE) between these entities. Due to the evident lack of datasets within Materials Informatics (MI), we evaluated using SuperMat, based on superconductor research, and MeasEval, a generic measurement evaluation corpus. The performance of LLMs in executing these tasks is benchmarked against traditional models based on the BERT architecture and rule-based approaches (baseline). We introduce a novel methodology for the comparative analysis of intricate material expressions, emphasising the standardisation of chemical formulas to tackle the complexities inherent in materials science information assessment. For NER, LLMs fail to outperform the baseline with zero-shot prompting and exhibit only limited improvement with few-shot prompting. However, a GPT-3.5-Turbo fine-tuned with the appropriate strategy for RE outperforms all models, including the baseline. Without any fine-tuning, GPT-4 and GPT-4-Turbo display remarkable reasoning and relationship extraction capabilities after being provided with merely a couple of examples, surpassing the baseline. Overall, the results suggest that although LLMs demonstrate relevant reasoning skills in connecting concepts, specialised models are currently a better choice for tasks requiring extracting complex domain-specific entities like materials. These insights provide initial guidance applicable to other materials science sub-domains in future work.

翻译：本研究致力于评估GPT-3.5-Turbo、GPT-4和GPT-4-Turbo等大语言模型（LLMs）从材料科学科学文档中提取结构化信息的能力。为此，我们主要聚焦于信息提取的两项关键任务：（i）针对研究对象材料及其物理属性的命名实体识别（NER）；（ii）实体间的关系抽取（RE）。鉴于材料信息学（MI）领域明显缺乏数据集，我们采用基于超导体研究的SuperMat数据集和通用测量评估语料库MeasEval进行评估。将LLMs执行这些任务的性能与基于BERT架构的传统模型和基于规则的方法（基线模型）进行对比。我们提出了一种用于复杂材料表达比较分析的新颖方法，重点强调化学式的标准化，以应对材料科学信息评估中固有的复杂性。在NER任务中，LLMs在零样本提示下未能超越基线模型，仅在少样本提示下表现出有限改进。然而，采用适当策略微调的GPT-3.5-Turbo在RE任务中超越了包括基线在内的所有模型。未经微调的GPT-4和GPT-4-Turbo在仅提供少量示例后展现出卓越的推理和关系抽取能力，甚至超越了基线模型。总体结果表明，尽管LLMs在连接概念方面展现出相关的推理能力，但在提取诸如材料等复杂领域特定实体时，专用模型目前仍是更优选择。这些见解为未来其他材料科学子领域的应用提供了初步指导。