How can we design protein sequences folding into the desired structures effectively and efficiently? AI methods for structure-based protein design have attracted increasing attention in recent years; however, few methods can simultaneously improve the accuracy and efficiency due to the lack of expressive features and autoregressive sequence decoder. To address these issues, we propose PiFold, which contains a novel residue featurizer and PiGNN layers to generate protein sequences in a one-shot way with improved recovery. Experiments show that PiFold could achieve 51.66\% recovery on CATH 4.2, while the inference speed is 70 times faster than the autoregressive competitors. In addition, PiFold achieves 58.72\% and 60.42\% recovery scores on TS50 and TS500, respectively. We conduct comprehensive ablation studies to reveal the role of different types of protein features and model designs, inspiring further simplification and improvement. The PyTorch code is available at \href{https://github.com/A4Bio/PiFold}{GitHub}.
翻译:我们如何有效且高效地设计出能够折叠成目标结构的蛋白质序列?近年来,基于结构的蛋白质设计人工智能方法日益受到关注;然而,由于缺乏高表达性的特征以及自回归序列解码器,极少有方法能同时提升准确率与效率。针对这些问题,我们提出PiFold模型,该模型包含新型残基特征化模块与PiGNN网络层,能够以一次性生成方式产出蛋白质序列,并显著提高恢复率。实验表明,在CATH 4.2数据集上,PiFold的恢复率达到51.66%,且推理速度比自回归类方法快70倍。此外,在TS50与TS500数据集上,PiFold分别取得58.72%和60.42%的恢复率。我们通过全面的消融实验揭示了不同类型蛋白质特征与模型设计的关键作用,为后续的简化与改进提供了启发。PyTorch代码已在GitHub上开源(\href{https://github.com/A4Bio/PiFold}{GitHub})。