Designing protein-binding proteins with high affinity is critical in biomedical research and biotechnology. Despite recent advancements targeting specific proteins, the ability to create high-affinity binders for arbitrary protein targets on demand, without extensive rounds of wet-lab testing, remains a significant challenge. Here, we introduce PPDiff, a diffusion model to jointly design the sequence and structure of binders for arbitrary protein targets in a non-autoregressive manner. PPDiffbuilds upon our developed Sequence Structure Interleaving Network with Causal attention layers (SSINC), which integrates interleaved self-attention layers to capture global amino acid correlations, k-nearest neighbor (kNN) equivariant graph layers to model local interactions in three-dimensional (3D) space, and causal attention layers to simplify the intricate interdependencies within the protein sequence. To assess PPDiff, we curate PPBench, a general protein-protein complex dataset comprising 706,360 complexes from the Protein Data Bank (PDB). The model is pretrained on PPBenchand finetuned on two real-world applications: target-protein mini-binder complex design and antigen-antibody complex design. PPDiffconsistently surpasses baseline methods, achieving success rates of 50.00%, 23.16%, and 16.89% for the pretraining task and the two downstream applications, respectively. The code, data and models are available at https://github.com/JocelynSong/PPDiff.
翻译:设计具有高亲和力的蛋白质结合蛋白在生物医学研究和生物技术中至关重要。尽管近期针对特定蛋白质的研究取得了进展,但无需大量湿实验室测试即可按需为任意蛋白质靶点创建高亲和力结合剂的能力仍然是一项重大挑战。本文提出PPDiff,一种扩散模型,以非自回归方式联合设计任意蛋白质靶点的结合剂序列与结构。PPDiff基于我们开发的具有因果注意力层的序列结构交错网络(SSINC),该网络整合了交错自注意力层以捕获全局氨基酸相关性、k近邻(kNN)等变图层以建模三维(3D)空间中的局部相互作用,以及因果注意力层以简化蛋白质序列内复杂的相互依赖关系。为评估PPDiff,我们构建了PPBench——一个通用蛋白质-蛋白质复合物数据集,包含来自蛋白质数据库(PDB)的706,360个复合物。该模型在PPBench上进行预训练,并在两个实际应用场景中进行微调:靶蛋白微型结合剂复合物设计与抗原-抗体复合物设计。PPDiff在预训练任务及两个下游应用中持续超越基线方法,成功率分别达到50.00%、23.16%和16.89%。代码、数据与模型可通过https://github.com/JocelynSong/PPDiff获取。