Inverse protein folding is challenging due to its inherent one-to-many mapping characteristic, where numerous possible amino acid sequences can fold into a single, identical protein backbone. This task involves not only identifying viable sequences but also representing the sheer diversity of potential solutions. However, existing discriminative models, such as transformer-based auto-regressive models, struggle to encapsulate the diverse range of plausible solutions. In contrast, diffusion probabilistic models, as an emerging genre of generative approaches, offer the potential to generate a diverse set of sequence candidates for determined protein backbones. We propose a novel graph denoising diffusion model for inverse protein folding, where a given protein backbone guides the diffusion process on the corresponding amino acid residue types. The model infers the joint distribution of amino acids conditioned on the nodes' physiochemical properties and local environment. Moreover, we utilize amino acid replacement matrices for the diffusion forward process, encoding the biologically-meaningful prior knowledge of amino acids from their spatial and sequential neighbors as well as themselves, which reduces the sampling space of the generative process. Our model achieves state-of-the-art performance over a set of popular baseline methods in sequence recovery and exhibits great potential in generating diverse protein sequences for a determined protein backbone structure.
翻译:逆蛋白质折叠因其固有的“一对多”映射特性而具有挑战性,即大量可能的氨基酸序列可以折叠成单个相同的蛋白质骨架。该任务不仅涉及识别可行序列,还需要表示潜在解决方案的丰富多样性。然而,现有的判别模型,如基于Transformer的自回归模型,难以涵盖各种合理解决方案的多样性。相比之下,扩散概率模型作为一种新兴的生成方法,具有为特定蛋白质骨架生成多样化候选序列的潜力。我们提出了一种新颖的图去噪扩散模型用于逆蛋白质折叠,其中给定的蛋白质骨架指导相应氨基酸残基类型上的扩散过程。该模型推断出以节点物理化学性质和局部环境为条件的氨基酸联合分布。此外,我们利用氨基酸替换矩阵进行扩散前向过程,编码来自空间和序列邻居以及自身的氨基酸生物学意义先验知识,从而减少了生成过程的采样空间。我们的模型在一系列流行基线方法的序列恢复任务中达到了最先进的性能,并在为特定蛋白质骨架结构生成多样化蛋白质序列方面表现出巨大潜力。