The automated extraction of chemical structures and their corresponding bioactivity data is essential for accelerating drug discovery and enabling data-driven research. Current optical chemical structure recognition tools lack the capability to autonomously link molecular structures with their bioactivity profiles, posing a significant bottleneck in structure-activity relationship analysis. To address this, we present BioChemInsight, an open-source pipeline that integrates DECIMER Segmentation with MolNexTR for chemical structure recognition, GLM-4.5V for compound identifier association, and PaddleOCR combined with GLM-4.6 for bioactivity extraction and unit normalization. We evaluated BioChemInsight on 181 patents covering 15 therapeutic targets. The system achieved an average extraction accuracy of above 90% across three key tasks: chemical structure recognition, bioactivity data extraction, and compound identifier association. Our analysis indicates that the chemical space covered by patents is largely complementary to that contained in established public database ChEMBL. Consequently, by enabling systematic patent mining, BioChemInsight provides access to chemical information underrepresented in ChEMBL. This capability expands the landscape of explorable compound-target interactions, enriches the data foundation for quantitative structure-activity relationship modeling and targeted screening, and reduces data preprocessing time from weeks to hours. BioChemInsight is available at https://github.com/dahuilangda/BioChemInsight.
翻译:化学结构及其对应生物活性数据的自动化提取对于加速药物发现和实现数据驱动研究至关重要。当前的光学化学结构识别工具缺乏将分子结构与其生物活性谱自主关联的能力,这构成了构效关系分析中的一个显著瓶颈。为解决这一问题,我们提出了BioChemInsight,这是一个开源流程,集成了DECIMER Segmentation与MolNexTR用于化学结构识别,GLM-4.5V用于化合物标识符关联,以及结合了GLM-4.6的PaddleOCR用于生物活性提取和单位归一化。我们在涵盖15个治疗靶点的181项专利上对BioChemInsight进行了评估。该系统在化学结构识别、生物活性数据提取和化合物标识符关联这三个关键任务上,平均提取准确率均超过90%。我们的分析表明,专利所覆盖的化学空间与现有公共数据库ChEMBL中包含的化学空间在很大程度上是互补的。因此,通过实现系统性的专利挖掘,BioChemInsight提供了获取在ChEMBL中代表性不足的化学信息的途径。这一能力扩展了可探索的化合物-靶点相互作用的范围,丰富了定量构效关系建模和靶向筛选的数据基础,并将数据预处理时间从数周缩短至数小时。BioChemInsight可在 https://github.com/dahuilangda/BioChemInsight 获取。