Most molecular diagram parsers recover chemical structure from raster images (e.g., PNGs). However, many PDFs include commands giving explicit locations and shapes for characters, lines, and polygons. We present a new parser that uses these born-digital PDF primitives as input. The parsing model is fast and accurate, and does not require GPUs, Optical Character Recognition (OCR), or vectorization. We use the parser to annotate raster images and then train a new multi-task neural network for recognizing molecules in raster images. We evaluate our parsers using SMILES and standard benchmarks, along with a novel evaluation protocol comparing molecular graphs directly that supports automatic error compilation and reveals errors missed by SMILES-based evaluation.
翻译:大多数分子结构图解析器从光栅图像(如PNG)中还原化学结构。然而,许多PDF包含显式指定字符、线条和多边形位置与形状的指令。我们提出一种新型解析器,它以这些数字原生的PDF图元作为输入。该解析模型快速精确,无需GPU、光学字符识别(OCR)或矢量化处理。我们利用该解析器对光栅图像进行标注,并训练了一种新的多任务神经网络用于识别光栅图像中的分子。我们通过SMILES和标准基准测试,结合一种可直接比较分子图谱的新型评估协议对解析器进行评价,该协议支持自动错误汇编,并能揭示基于SMILES的评估遗漏的误差。