Recent research in representation learning utilizes large databases of proteins or molecules to acquire knowledge of drug and protein structures through unsupervised learning techniques. These pre-trained representations have proven to significantly enhance the accuracy of subsequent tasks, such as predicting the affinity between drugs and target proteins. In this study, we demonstrate that by incorporating knowledge graphs from diverse sources and modalities into the sequences or SMILES representation, we can further enrich the representation and achieve state-of-the-art results on established benchmark datasets. We provide preprocessed and integrated data obtained from 7 public sources, which encompass over 30M triples. Additionally, we make available the pre-trained models based on this data, along with the reported outcomes of their performance on three widely-used benchmark datasets for drug-target binding affinity prediction found in the Therapeutic Data Commons (TDC) benchmarks. Additionally, we make the source code for training models on benchmark datasets publicly available. Our objective in releasing these pre-trained models, accompanied by clean data for model pretraining and benchmark results, is to encourage research in knowledge-enhanced representation learning.
翻译:近期表示学习研究利用大规模蛋白质或分子数据库,通过无监督学习技术获取药物与蛋白质结构知识。这些预训练表示已被证明能显著提升后续任务的准确性,例如预测药物与靶点蛋白的亲和力。本研究表明,通过将来自不同来源与模态的知识图谱融入序列或SMILES表示中,可进一步丰富表示质量,并在已有基准数据集上取得最优结果。我们提供了来自7个公开来源的预处理与整合数据,包含超过3000万个三元组。同时,我们开源基于这些数据的预训练模型,并报告其在治疗数据共同体(TDC)基准中三个广泛用于药物-靶点结合亲和力预测的基准数据集上的性能结果。此外,我们公开了在基准数据集上训练模型的源代码。发布这些附带干净预训练数据与基准结果的预训练模型,旨在推动知识增强型表示学习研究。