How to effectively represent molecules is a long-standing challenge for molecular property prediction and drug discovery. This paper studies this problem and proposes to incorporate chemical domain knowledge, specifically related to chemical reactions, for learning effective molecular representations. However, the inherent cross-modality property between chemical reactions and molecules presents a significant challenge to address. To this end, we introduce a novel method, namely MolKD, which Distills cross-modal Knowledge in chemical reactions to assist Molecular property prediction. Specifically, the reaction-to-molecule distillation model within MolKD transfers cross-modal knowledge from a pre-trained teacher network learning with one modality (i.e., reactions) into a student network learning with another modality (i.e., molecules). Moreover, MolKD learns effective molecular representations by incorporating reaction yields to measure transformation efficiency of the reactant-product pair when pre-training on reactions. Extensive experiments demonstrate that MolKD significantly outperforms various competitive baseline models, e.g., 2.1% absolute AUC-ROC gain on Tox21. Further investigations demonstrate that pre-trained molecular representations in MolKD can distinguish chemically reasonable molecular similarities, which enables molecular property prediction with high robustness and interpretability.
翻译:如何有效表征分子一直是分子性质预测和药物发现中的长期挑战。本文研究该问题,并提出融入化学领域知识(尤其涉及化学反应)以学习有效的分子表征。然而,化学反应与分子之间固有的跨模态特性构成了重大挑战。为此,我们引入一种名为MolKD的新方法,该方法通过蒸馏化学反应中的跨模态知识来辅助分子性质预测。具体而言,MolKD中的反应-分子蒸馏模型将预训练的教师网络(以反应为模态学习)中的跨模态知识,迁移至以另一模态(即分子)学习的学生网络中。此外,MolKD在基于化学反应预训练时,通过融入反应产率以衡量反应物-产物对的转化效率,从而学习有效的分子表征。大量实验表明,MolKD显著优于多种竞争性基线模型(例如,在Tox21数据集上绝对AUC-ROC提升2.1%)。进一步研究证实,MolKD预训练的分子表征能够区分化学上合理的分子相似性,从而赋予分子性质预测高鲁棒性与可解释性。