AI-assisted protein design has emerged as a critical tool for advancing biotechnology, as deep generative models have demonstrated their reliability in this domain. However, most existing models primarily utilize protein sequence or structural data for training, neglecting the physicochemical properties of proteins.Moreover, they are deficient to control the generation of proteins in intuitive conditions. To address these limitations,we propose CMADiff here, a novel framework that enables controllable protein generation by aligning the physicochemical properties of protein sequences with text-based descriptions through a latent diffusion process. Specifically, CMADiff employs a Conditional Variational Autoencoder (CVAE) to integrate physicochemical features as conditional input, forming a robust latent space that captures biological traits. In this latent space, we apply a conditional diffusion process, which is guided by BioAligner, a contrastive learning-based module that aligns text descriptions with protein features, enabling text-driven control over protein sequence generation. Validated by a series of evaluations including AlphaFold3, the experimental results indicate that CMADiff outperforms protein sequence generation benchmarks and holds strong potential for future applications. The implementation and code are available at https://github.com/HPC-NEAU/PhysChemDiff.
翻译:随着深度生成模型在该领域展现出可靠性,人工智能辅助蛋白质设计已成为推进生物技术发展的关键工具。然而,现有模型大多主要利用蛋白质序列或结构数据进行训练,忽略了蛋白质的理化性质。此外,这些模型难以在直观条件下控制蛋白质的生成。为应对这些局限性,本文提出CMADiff,一种新颖的框架,它通过潜在扩散过程将蛋白质序列的理化性质与基于文本的描述对齐,从而实现可控的蛋白质生成。具体而言,CMADiff采用条件变分自编码器(CVAE)整合理化特征作为条件输入,构建了一个能够捕捉生物学特性的鲁棒潜在空间。在此潜在空间中,我们应用条件扩散过程,该过程由基于对比学习的模块BioAligner引导,该模块将文本描述与蛋白质特征对齐,从而实现对蛋白质序列生成的文本驱动控制。通过包括AlphaFold3在内的一系列评估验证,实验结果表明,CMADiff在蛋白质序列生成基准测试中表现优异,并展现出强大的未来应用潜力。实现代码可在 https://github.com/HPC-NEAU/PhysChemDiff 获取。