Protein-ligand structure prediction is an essential task in drug discovery, predicting the binding interactions between small molecules (ligands) and target proteins (receptors). Recent advances have incorporated deep learning techniques to improve the accuracy of protein-ligand structure prediction. Nevertheless, the experimental validation of docking conformations remains costly, it raises concerns regarding the generalizability of these deep learning-based methods due to the limited training data. In this work, we show that by pre-training on a large-scale docking conformation generated by traditional physics-based docking tools and then fine-tuning with a limited set of experimentally validated receptor-ligand complexes, we can obtain a protein-ligand structure prediction model with outstanding performance. Specifically, this process involved the generation of 100 million docking conformations for protein-ligand pairings, an endeavor consuming roughly 1 million CPU core days. The proposed model, HelixDock, aims to acquire the physical knowledge encapsulated by the physics-based docking tools during the pre-training phase. HelixDock has been rigorously benchmarked against both physics-based and deep learning-based baselines, demonstrating its exceptional precision and robust transferability in predicting binding confirmation. In addition, our investigation reveals the scaling laws governing pre-trained protein-ligand structure prediction models, indicating a consistent enhancement in performance with increases in model parameters and the volume of pre-training data. Moreover, we applied HelixDock to several drug discovery-related tasks to validate its practical utility. HelixDock demonstrates outstanding capabilities on both cross-docking and structure-based virtual screening benchmarks.
翻译:蛋白质-配体结构预测是药物发现中的关键任务,旨在预测小分子(配体)与靶蛋白(受体)之间的结合相互作用。近期研究通过引入深度学习技术提升了蛋白质-配体结构预测的精度。然而,对接构象的实验验证成本高昂,且有限的训练数据引发了人们对这些基于深度学习的方法泛化能力的担忧。本研究表明,通过基于传统物理对接工具生成的大规模对接构象进行预训练,再使用有限数量的实验验证受体-配体复合物进行微调,可以获得性能卓越的蛋白质-配体结构预测模型。具体而言,该过程生成了1亿个蛋白质-配体配对的对接构象,消耗约100万CPU核心日的计算资源。所提出的HelixDock模型旨在通过预训练阶段获取物理对接工具所蕴含的物理知识。HelixDock经过与基于物理方法和基于深度学习基准模型的严格对比测试,在预测结合构象方面展现出卓越的精度和强大的可迁移性。此外,我们的研究揭示了预训练蛋白质-配体结构预测模型的缩放规律,表明随着模型参数和预训练数据量的增加,模型性能持续提升。最后,我们将HelixDock应用于多项药物发现相关任务以验证其实用价值。该模型在交叉对接和基于结构的虚拟筛选基准测试中均表现出优异性能。