The automatic analysis of chemical literature has immense potential to accelerate the discovery of new materials and drugs. Much of the critical information in patent documents and scientific articles is contained in figures, depicting the molecule structures. However, automatically parsing the exact chemical structure is a formidable challenge, due to the amount of detailed information, the diversity of drawing styles, and the need for training data. In this work, we introduce MolGrapher to recognize chemical structures visually. First, a deep keypoint detector detects the atoms. Second, we treat all candidate atoms and bonds as nodes and put them in a graph. This construct allows a natural graph representation of the molecule. Last, we classify atom and bond nodes in the graph with a Graph Neural Network. To address the lack of real training data, we propose a synthetic data generation pipeline producing diverse and realistic results. In addition, we introduce a large-scale benchmark of annotated real molecule images, USPTO-30K, to spur research on this critical topic. Extensive experiments on five datasets show that our approach significantly outperforms classical and learning-based methods in most settings. Code, models, and datasets are available.
翻译:化学文献的自动分析具有加速新材料和药物发现的巨大潜力。专利文件和科学文章中大量关键信息存在于描绘分子结构的图表中。然而,由于详细信息的数量、绘图风格的多样性以及训练数据的需求,自动解析精确的化学结构是一项艰巨的挑战。在本研究中,我们提出MolGrapher以视觉方式识别化学结构。首先,深度关键点检测器检测原子;其次,我们将所有候选原子和键视为节点,并将其置于图中。这一结构允许对分子进行自然的图表示。最后,我们使用图神经网络对图中的原子和键节点进行分类。为解决真实训练数据缺乏的问题,我们提出了一种生成多样且逼真结果的合成数据生成流程。此外,我们引入了一个包含标注真实分子图像的大规模基准数据集USPTO-30K,以推动这一关键领域的研究。在五个数据集上的大量实验表明,我们的方法在大多数设置下显著优于传统方法和基于学习的方法。代码、模型和数据集已公开。