Discovery of new materials has a documented history of propelling human progress for centuries and more. The behaviour of a material is a function of its composition, structure, and properties, which further depend on its processing and testing conditions. Recent developments in deep learning and natural language processing have enabled information extraction at scale from published literature such as peer-reviewed publications, books, and patents. However, this information is spread in multiple formats, such as tables, text, and images, and with little or no uniformity in reporting style giving rise to several machine learning challenges. Here, we discuss, quantify, and document these outstanding challenges in automated information extraction (IE) from materials science literature towards the creation of a large materials science knowledge base. Specifically, we focus on IE from text and tables and outline several challenges with examples. We hope the present work inspires researchers to address the challenges in a coherent fashion, providing to fillip to IE for the materials knowledge base.
翻译:新材料的发现有着数百年乃至更久远推动人类进步的有据可查的历史。材料的行为由其成分、结构和性能共同决定,而这些又进一步取决于其加工和测试条件。深度学习与自然语言处理的最新发展,使得从已发表文献(如同行评审出版物、书籍和专利)中大规模提取信息成为可能。然而,这些信息分布在表格、文本和图像等多种格式中,且报告风格缺乏或几乎没有统一性,由此引发了一系列机器学习挑战。本文针对从材料科学文献中实现自动化信息提取以构建大型材料科学知识库这一目标,讨论、量化并记录了这些突出的挑战。具体而言,我们聚焦于从文本和表格中进行信息提取,并结合实例概述了若干挑战。希望本研究能激励研究人员以连贯的方式应对这些挑战,从而为材料科学知识库的信息提取提供助力。