Discovery of new materials has a documented history of propelling human progress for centuries and more. The behaviour of a material is a function of its composition, structure, and properties, which further depend on its processing and testing conditions. Recent developments in deep learning and natural language processing have enabled information extraction at scale from published literature such as peer-reviewed publications, books, and patents. However, this information is spread in multiple formats, such as tables, text, and images, and with little or no uniformity in reporting style giving rise to several machine learning challenges. Here, we discuss, quantify, and document these challenges in automated information extraction (IE) from materials science literature towards the creation of a large materials science knowledge base. Specifically, we focus on IE from text and tables and outline several challenges with examples. We hope the present work inspires researchers to address the challenges in a coherent fashion, providing a fillip to IE towards a materials knowledge base.
翻译:新材料的发现有着数百年来推动人类进步的明确历史。材料的行为是其成分、结构和性能的函数,而这些又进一步取决于其加工和测试条件。深度学习与自然语言处理的最新发展,使得从已发表文献(如同行评议出版物、书籍和专利)中大规模提取信息成为可能。然而,这些信息以表格、文本和图像等多种格式分散存在,且报告风格缺乏统一性,引发了多项机器学习挑战。本文围绕从材料科学文献中进行自动化信息提取(IE)以构建大型材料科学知识库这一目标,系统讨论、量化并记录了这些挑战。我们重点聚焦于从文本和表格中提取信息,并通过实例概述了若干挑战。希望本工作能激励研究人员以连贯的方式应对这些挑战,从而推动材料知识库的信息提取工作。