Molecule representation learning underpins diverse downstream applications such as molecular property and side effect understanding and prediction. In this paper, we recognize the two-level structure of individual molecule as having intrinsic graph structure as well as being a node in a large molecule knowledge graph, and present GODE, a new approach that seamlessly integrates graph representations of individual molecules with multi-domain biomedical data from knowledge graphs. By pre-training two graph neural networks (GNNs) on different graph structures, combined with contrastive learning, GODE adeptly fuses molecular structures with their corresponding knowledge graph substructures. This fusion results in a more robust and informative representation, enhancing molecular property prediction by harnessing both chemical and biological information. Finetuned on 11 chemical property tasks, our model surpasses benchmarks, achieving an average ROC-AUC improvement of 14.5%, 9.8%, and 7.3% on BBBP, SIDER, and Tox21 datasets. In regression tasks on ESOL and QM7 datasets, we achieve average improvements of 21.0% and 29.6% improvements in RMSE and MAE, setting a new field benchmark.
翻译:分子表示学习是分子性质理解与预测、副作用分析等一系列下游应用的基础。本文针对单个分子具有内在图结构、同时作为大规模分子知识图谱中节点这一双层结构特性,提出GODE方法,将单个分子的图表示与知识图谱中的多领域生物医学数据无缝融合。通过在不同图结构上预训练两个图神经网络(GNN)并联合对比学习,GODE巧妙地将分子结构与其对应的知识图谱子结构相融合。这种融合生成的表示更为鲁棒且信息丰富,通过整合化学与生物学信息增强了分子性质预测能力。在11项化学性质任务上微调后,我们的模型超越了基准,在BBBP、SIDER和Tox21数据集上分别实现了ROC-AUC平均提升14.5%、9.8%和7.3%。在ESOL和QM7数据集的回归任务中,RMSE和MAE分别平均改善21.0%和29.6%,创立了该领域的新标杆。