The advent of deep learning has introduced efficient approaches for de novo protein sequence design, significantly improving success rates and reducing development costs compared to computational or experimental methods. However, existing methods face challenges in generating proteins with diverse lengths and shapes while maintaining key structural features. To address these challenges, we introduce CPDiffusion-SS, a latent graph diffusion model that generates protein sequences based on coarse-grained secondary structural information. CPDiffusion-SS offers greater flexibility in producing a variety of novel amino acid sequences while preserving overall structural constraints, thus enhancing the reliability and diversity of generated proteins. Experimental analyses demonstrate the significant superiority of the proposed method in producing diverse and novel sequences, with CPDiffusion-SS surpassing popular baseline methods on open benchmarks across various quantitative measurements. Furthermore, we provide a series of case studies to highlight the biological significance of the generation performance by the proposed method. The source code is publicly available at https://github.com/riacd/CPDiffusion-SS
翻译:深度学习的发展为从头设计蛋白质序列引入了高效方法,与计算或实验方法相比,显著提高了成功率并降低了开发成本。然而,现有方法在生成具有多样长度和形状的蛋白质同时保持关键结构特征方面面临挑战。为应对这些挑战,我们提出了CPDiffusion-SS,一种基于粗粒度二级结构信息生成蛋白质序列的潜在图扩散模型。CPDiffusion-SS在产生各种新型氨基酸序列方面具有更高的灵活性,同时保持整体结构约束,从而增强了生成蛋白质的可靠性和多样性。实验分析表明,所提方法在生成多样且新颖的序列方面具有显著优势,CPDiffusion-SS在开放基准测试的各项定量指标上均超越了主流基线方法。此外,我们提供了一系列案例研究,以突显所提方法生成性能的生物学意义。源代码已公开于https://github.com/riacd/CPDiffusion-SS。