The amount of data has growing significance in exploring cutting-edge materials and a number of datasets have been generated either by hand or automated approaches. However, the materials science field struggles to effectively utilize the abundance of data, especially in applied disciplines where materials are evaluated based on device performance rather than their properties. This article presents a new natural language processing (NLP) task called structured information inference (SII) to address the complexities of information extraction at the device level in materials science. We accomplished this task by tuning GPT-3 on an existing perovskite solar cell FAIR (Findable, Accessible, Interoperable, Reusable) dataset with 91.8% F1-score and extended the dataset with data published since its release. The produced data is formatted and normalized, enabling its direct utilization as input in subsequent data analysis. This feature empowers materials scientists to develop models by selecting high-quality review articles within their domain. Additionally, we designed experiments to predict the electrical performance of solar cells and design materials or devices with targeted parameters using large language models (LLMs). Our results demonstrate comparable performance to traditional machine learning methods without feature selection, highlighting the potential of LLMs to acquire scientific knowledge and design new materials akin to materials scientists.
翻译:数据在探索前沿材料中的重要性日益凸显,通过人工或自动化方法已生成大量数据集。然而,材料科学领域在有效利用这些丰富数据方面仍面临挑战,尤其是在以器件性能而非材料性质评估材料的应用学科中。本文提出一种名为结构化信息推理(SII)的新型自然语言处理任务,以应对材料科学中器件层面信息提取的复杂性。我们通过基于现有钙钛矿太阳能电池FAIR(可查找、可访问、可互操作、可重用)数据集微调GPT-3模型,实现了91.8%的F1分数,并利用该数据集发布后新增的数据对其进行了扩展。生成的数据经过格式化和标准化处理,可直接用作后续数据分析的输入。这一特性使材料科学家能够通过筛选所在领域的高质量综述文章来开发模型。此外,我们设计了实验,利用大型语言模型(LLMs)预测太阳能电池的电性能,并设计具有目标参数的材料或器件。结果表明,在不进行特征选择的情况下,我们的方法性能与传统机器学习方法相当,突显了LLMs像材料科学家一样获取科学知识并设计新材料的潜力。