Pre-Training on Large-Scale Generated Docking Conformations with HelixDock to Unlock the Potential of Protein-ligand Structure Prediction Models

Protein-ligand structure prediction is an essential task in drug discovery, predicting the binding interactions between small molecules (ligands) and target proteins (receptors). Although conventional physics-based docking tools are widely utilized, their accuracy is compromised by limited conformational sampling and imprecise scoring functions. Recent advances have incorporated deep learning techniques to improve the accuracy of structure prediction. Nevertheless, the experimental validation of docking conformations remains costly, it raises concerns regarding the generalizability of these deep learning-based methods due to the limited training data. In this work, we show that by pre-training a geometry-aware SE(3)-Equivariant neural network on a large-scale docking conformation generated by traditional physics-based docking tools and then fine-tuning with a limited set of experimentally validated receptor-ligand complexes, we can achieve outstanding performance. This process involved the generation of 100 million docking conformations, consuming roughly 1 million CPU core days. The proposed model, HelixDock, aims to acquire the physical knowledge encapsulated by the physics-based docking tools during the pre-training phase. HelixDock has been benchmarked against both physics-based and deep learning-based baselines, showing that it outperforms its closest competitor by over 40% for RMSD. HelixDock also exhibits enhanced performance on a dataset that poses a greater challenge, thereby highlighting its robustness. Moreover, our investigation reveals the scaling laws governing pre-trained structure prediction models, indicating a consistent enhancement in performance with increases in model parameters and pre-training data. This study illuminates the strategic advantage of leveraging a vast and varied repository of generated data to advance the frontiers of AI-driven drug discovery.

翻译：蛋白质-配体结构预测是药物发现中的核心任务，旨在预测小分子（配体）与靶标蛋白（受体）之间的结合相互作用。尽管传统基于物理学的对接工具被广泛使用，但其准确性受限于有限的构象采样和不精确的评分函数。近年来，深度学习技术的引入提升了结构预测的精度。然而，对接构象的实验验证成本高昂，这引发了对基于深度学习方法因训练数据有限而泛化能力的担忧。本研究表明，通过在传统物理学对接工具生成的大规模对接构象上预训练几何感知的SE(3)-等变神经网络，并随后利用有限数量的实验验证的受体-配体复合物进行微调，即可实现卓越性能。该过程生成了1亿个对接构象，消耗约100万CPU核日。所提出的HelixDock模型旨在预训练阶段获取物理学对接工具所蕴含的物理知识。与基于物理学和深度学习的基线模型相比，HelixDock在RMSD指标上以超过40%的优势超越最接近的竞品。在更具挑战性的数据集上，HelixDock同样展现了增强的性能，凸显其鲁棒性。此外，我们的研究揭示了预训练结构预测模型的规模定律，即模型参数和预训练数据的增加会持续提升性能。本工作阐明了利用大规模多样化生成数据推动AI驱动药物发现前沿的战略优势。