Protein design has become a critical method in advancing significant potential for various applications such as drug development and enzyme engineering. However, protein design methods utilizing large language models with solely pretraining and fine-tuning struggle to capture relationships in multi-modal protein data. To address this, we propose ProtDAT, a de novo fine-grained framework capable of designing proteins from any descriptive protein text input. ProtDAT builds upon the inherent characteristics of protein data to unify sequences and text as a cohesive whole rather than separate entities. It leverages an innovative multi-modal cross-attention, integrating protein sequences and textual information for a foundational level and seamless integration. Experimental results demonstrate that ProtDAT achieves the state-of-the-art performance in protein sequence generation, excelling in rationality, functionality, structural similarity, and validity. On 20,000 text-sequence pairs from Swiss-Prot, it improves pLDDT by 6%, TM-score by 0.26, and reduces RMSD by 1.2 {\AA}, highlighting its potential to advance protein design.
翻译:蛋白质设计已成为推动药物开发和酶工程等多种应用领域巨大潜力的关键方法。然而,仅依赖预训练和微调的大语言模型蛋白质设计方法难以捕捉多模态蛋白质数据中的关联关系。为解决这一问题,我们提出了ProtDAT,一种能够根据任意描述性蛋白质文本输入设计蛋白质的从头细粒度框架。ProtDAT基于蛋白质数据的内在特性,将序列与文本统一为连贯整体而非独立实体。该框架采用创新的多模态交叉注意力机制,在基础层面实现蛋白质序列与文本信息的深度融合。实验结果表明,ProtDAT在蛋白质序列生成任务中取得了最先进的性能,在合理性、功能性、结构相似性和有效性方面表现卓越。在Swiss-Prot的20,000个文本-序列对测试中,其pLDDT指标提升6%,TM-score提高0.26,RMSD降低1.2 Å,彰显了推动蛋白质设计发展的巨大潜力。