The field of protein folding research has been greatly advanced by deep learning methods, with AlphaFold2 (AF2) demonstrating exceptional performance and atomic-level precision. As co-evolution is integral to protein structure prediction, AF2's accuracy is significantly influenced by the depth of multiple sequence alignment (MSA), which requires extensive exploration of a large protein database for similar sequences. However, not all protein sequences possess abundant homologous families, and consequently, AF2's performance can degrade on such queries, at times failing to produce meaningful results. To address this, we introduce a novel generative language model, MSA-Augmenter, which leverages protein-specific attention mechanisms and large-scale MSAs to generate useful, novel protein sequences not currently found in databases. These sequences supplement shallow MSAs, enhancing the accuracy of structural property predictions. Our experiments on CASP14 demonstrate that MSA-Augmenter can generate de novo sequences that retain co-evolutionary information from inferior MSAs, thereby improving protein structure prediction quality on top of strong AF2.
翻译:蛋白质折叠研究领域已通过深度学习方法取得重大进展,其中AlphaFold2(AF2)展现出卓越的性能和原子级精度。由于共进化是蛋白质结构预测的关键组成部分,AF2的准确性显著受多序列比对(MSA)深度的影响,这需要对大型蛋白质数据库进行广泛搜索以寻找相似序列。然而,并非所有蛋白质序列都拥有丰富的同源家族,因此AF2在此类查询上的性能可能下降,有时甚至无法产生有意义的结果。为解决这一问题,我们提出了一种新型生成语言模型MSA-Augmenter,该模型利用蛋白质特异性注意力机制和大规模MSA生成数据库中目前不存在的有用且新颖的蛋白质序列。这些序列可补充浅层MSA,从而提高结构性质预测的准确性。我们在CASP14上的实验表明,MSA-Augmenter能够生成从头序列,这些序列保留了劣质MSA中的共进化信息,从而在强大的AF2基础上进一步提升蛋白质结构预测质量。