Information extraction and textual comprehension from materials literature are vital for developing an exhaustive knowledge base that enables accelerated materials discovery. Language models have demonstrated their capability to answer domain-specific questions and retrieve information from knowledge bases. However, there are no benchmark datasets in the materials domain that can evaluate the understanding of the key concepts by these language models. In this work, we curate a dataset of 650 challenging questions from the materials domain that require the knowledge and skills of a materials student who has cleared their undergraduate degree. We classify these questions based on their structure and the materials science domain-based subcategories. Further, we evaluate the performance of GPT-3.5 and GPT-4 models on solving these questions via zero-shot and chain of thought prompting. It is observed that GPT-4 gives the best performance (~62% accuracy) as compared to GPT-3.5. Interestingly, in contrast to the general observation, no significant improvement in accuracy is observed with the chain of thought prompting. To evaluate the limitations, we performed an error analysis, which revealed conceptual errors (~64%) as the major contributor compared to computational errors (~36%) towards the reduced performance of LLMs. We hope that the dataset and analysis performed in this work will promote further research in developing better materials science domain-specific LLMs and strategies for information extraction.
翻译:从材料文献中进行信息提取和文本理解对于构建促进材料加速发现的详尽知识库至关重要。语言模型已展现出回答领域特定问题以及从知识库中检索信息的能力。然而,在材料领域尚缺乏能够评估这些语言模型对关键概念理解能力的基准数据集。本研究整理了一个包含650个具有挑战性的材料领域问题的数据集,回答这些问题需要具备已完成本科学位的材料专业学生的知识与技能。我们根据问题的结构以及材料科学领域子类别对其进行了分类。此外,我们评估了GPT-3.5和GPT-4模型在零样本提示与思维链提示下解决这些问题的表现。结果表明,GPT-4的性能最佳(准确率约62%),优于GPT-3.5。有趣的是,与普遍观察相反,思维链提示并未带来准确率的显著提升。为评估局限性,我们进行了错误分析,发现概念性错误(约64%)是导致大型语言模型性能下降的主要因素,而计算性错误约占36%。我们希望本研究中的数据集与分析能推动开发更好的材料科学领域专用大型语言模型以及信息提取策略的进一步研究。