Proteins are macromolecules responsible for essential functions in almost all living organisms. Designing reasonable proteins with desired functions is crucial. A protein's sequence and structure are strongly correlated and they together determine its function. In this paper, we propose NAEPro, a model to jointly design Protein sequence and structure based on automatically detected functional sites. NAEPro is powered by an interleaving network of attention and equivariant layers, which can capture global correlation in a whole sequence and local influence from nearest amino acids in three dimensional (3D) space. Such an architecture facilitates effective yet economic message passing at two levels. We evaluate our model and several strong baselines on two protein datasets, $\beta$-lactamase and myoglobin. Experimental results show that our model consistently achieves the highest amino acid recovery rate, TM-score, and the lowest RMSD among all competitors. These findings prove the capability of our model to design protein sequences and structures that closely resemble their natural counterparts. Furthermore, in-depth analysis further confirms our model's ability to generate highly effective proteins capable of binding to their target metallocofactors. We provide code, data and models in Github.
翻译:蛋白质是几乎所有生命体中负责关键功能的大分子。设计具有预期功能的合理蛋白质至关重要。蛋白质的序列与结构密切相关,二者共同决定其功能。本文提出NAEPro模型,该模型基于自动检测的功能位点,实现蛋白质序列与结构的联合设计。NAEPro采用交错式注意力网络与等变层架构,能够捕获全序列的全局相关性以及三维空间中最邻近氨基酸的局部影响。这种架构在双层级实现了高效而经济的消息传递。我们在β-内酰胺酶和肌红蛋白两个蛋白质数据集上评估了该模型及多个强基线方法。实验结果表明,在所有对比方法中,我们的模型始终获得最高的氨基酸恢复率、TM-score和最低的RMSD值。这些发现证明该模型能够设计出高度接近天然蛋白质的序列与结构。进一步深度分析证实,模型具备生成可有效结合靶标金属辅因子的高效蛋白质的能力。相关代码、数据及模型已发布在GitHub。