HelixFold-Single: MSA-free Protein Structure Prediction by Using Protein Language Model as an Alternative

AI-based protein structure prediction pipelines, such as AlphaFold2, have achieved near-experimental accuracy. These advanced pipelines mainly rely on Multiple Sequence Alignments (MSAs) as inputs to learn the co-evolution information from the homologous sequences. Nonetheless, searching MSAs from protein databases is time-consuming, usually taking dozens of minutes. Consequently, we attempt to explore the limits of fast protein structure prediction by using only primary sequences of proteins. HelixFold-Single is proposed to combine a large-scale protein language model with the superior geometric learning capability of AlphaFold2. Our proposed method, HelixFold-Single, first pre-trains a large-scale protein language model (PLM) with thousands of millions of primary sequences utilizing the self-supervised learning paradigm, which will be used as an alternative to MSAs for learning the co-evolution information. Then, by combining the pre-trained PLM and the essential components of AlphaFold2, we obtain an end-to-end differentiable model to predict the 3D coordinates of atoms from only the primary sequence. HelixFold-Single is validated in datasets CASP14 and CAMEO, achieving competitive accuracy with the MSA-based methods on the targets with large homologous families. Furthermore, HelixFold-Single consumes much less time than the mainstream pipelines for protein structure prediction, demonstrating its potential in tasks requiring many predictions. The code of HelixFold-Single is available at https://github.com/PaddlePaddle/PaddleHelix/tree/dev/apps/protein_folding/helixfold-single, and we also provide stable web services on https://paddlehelix.baidu.com/app/drug/protein-single/forecast.

翻译：基于人工智能的蛋白质结构预测流程（如AlphaFold2）已实现接近实验精度的预测能力。这些先进流程主要依赖多序列比对作为输入，从同源序列中学习共进化信息。然而，从蛋白质数据库中搜索MSA耗时较长，通常需要数十分钟。为此，我们尝试仅利用蛋白质一级序列探索快速蛋白质结构预测的极限。本文提出HelixFold-Single，将大规模蛋白质语言模型与AlphaFold2优越的几何学习能力相结合。该方法首先通过自监督学习范式，利用数十亿条一级序列预训练大规模蛋白质语言模型，作为MSA的替代方案来学习共进化信息；随后结合预训练的PLM与AlphaFold2核心组件，构建端到端可微模型，从一级序列直接预测原子三维坐标。在CASP14和CAMEO数据集上的验证表明，针对同源家族丰富的靶标，HelixFold-Single可获得与基于MSA方法相当的预测精度。此外，HelixFold-Single的预测耗时远低于主流蛋白质结构预测流程，展现了其在需要大量预测任务中的应用潜力。HelixFold-Single代码已开源至https://github.com/PaddlePaddle/PaddleHelix/tree/dev/apps/protein_folding/helixfold-single，同时提供稳定Web服务：https://paddlehelix.baidu.com/app/drug/protein-single/forecast。