Modern research increasingly relies on automated methods to assist researchers. An example of this is Optical Chemical Structure Recognition (OCSR), which aids chemists in retrieving information about chemicals from large amounts of documents. Markush structures are chemical structures that cannot be parsed correctly by OCSR and cause errors. The focus of this research was to propose and test a novel method for classifying Markush structures. Within this method, a comparison was made between fixed-feature extraction and end-to-end learning (CNN). The end-to-end method performed significantly better than the fixed-feature method, achieving 0.928 (0.035 SD) Macro F1 compared to the fixed-feature method's 0.701 (0.052 SD). Because of the nature of the experiment, these figures are a lower bound and can be improved further. These results suggest that Markush structures can be filtered out effectively and accurately using the proposed method. When implemented into OCSR pipelines, this method can improve their performance and use to other researchers.
翻译:现代研究日益依赖自动化方法来辅助科研人员。其中,光学化学结构识别(OCSR)能够帮助化学家从大量文献中检索化学物质信息。Markush结构是一类无法被OCSR正确解析的化学结构,会导致识别错误。本研究旨在提出并测试一种用于分类Markush结构的新方法。在该方法中,我们比较了固定特征提取与端到端学习(CNN)两种策略。端到端方法的性能显著优于固定特征方法,其Macro F1值达0.928(标准差0.035),而固定特征方法仅为0.701(标准差0.052)。由于实验性质,这些结果属于性能下限,仍有进一步优化空间。结果表明,所提方法能有效且准确地过滤Markush结构。当该方法集成到OCSR流程中时,可提升其性能,从而惠及更多研究人员。