We have developed MatGD (Material Graph Digitizer), which is a tool for digitizing a data line from scientific graphs. The algorithm behind the tool consists of four steps: (1) identifying graphs within subfigures, (2) separating axes and data sections, (3) discerning the data lines by eliminating irrelevant graph objects and matching with the legend, and (4) data extraction and saving. From the 62,534 papers in the areas of batteries, catalysis, and MOFs, 501,045 figures were mined. Remarkably, our tool showcased performance with over 99% accuracy in legend marker and text detection. Moreover, its capability for data line separation stood at 66%, which is much higher compared to other existing figure mining tools. We believe that this tool will be integral to collecting both past and future data from publications, and these data can be used to train various machine learning models that can enhance material predictions and new materials discovery.
翻译:我们开发了MatGD(材料图形数字化器),这是一种用于从科学图形中数字化数据线的工具。该工具背后的算法包含四个步骤:(1)识别子图中的图形,(2)分离坐标轴与数据区域,(3)通过消除无关图形对象并与图例匹配来辨别数据线,以及(4)数据提取与保存。从电池、催化和MOF领域的62,534篇论文中,我们挖掘了501,045张图形。值得注意的是,该工具在图例标记与文本检测中表现出超过99%的准确率。此外,其数据线分离能力达到66%,远高于其他现有图形挖掘工具。我们相信,该工具将成为收集出版物中历史与未来数据的关键,这些数据可用于训练各类机器学习模型,从而增强材料预测并推动新材料发现。