In the field of antibody engineering, an essential task is to design a novel antibody whose paratopes bind to a specific antigen with correct epitopes. Understanding antibody structure and its paratope can facilitate a mechanistic understanding of its function. Therefore, antibody structure prediction from its sequence alone has always been a highly valuable problem for de novo antibody design. AlphaFold2, a breakthrough in the field of structural biology, provides a solution to predict protein structure based on protein sequences and computationally expensive coevolutionary multiple sequence alignments (MSAs). However, the computational efficiency and undesirable prediction accuracy of antibodies, especially on the complementarity-determining regions (CDRs) of antibodies limit their applications in the industrially high-throughput drug design. To learn an informative representation of antibodies, we employed a deep antibody language model (ALM) on curated sequences from the observed antibody space database via a transformer model. We also developed a novel model named xTrimoABFold to predict antibody structure from antibody sequence based on the pretrained ALM as well as efficient evoformers and structural modules. The model was trained end-to-end on the antibody structures in PDB by minimizing the ensemble loss of domain-specific focal loss on CDR and the frame-aligned point loss. xTrimoABFold outperforms AlphaFold2 and other protein language model based SOTAs, e.g., OmegaFold, HelixFold-Single, and IgFold with a large significant margin (30+\% improvement on RMSD) while performing 151 times faster than AlphaFold2. To the best of our knowledge, xTrimoABFold achieved state-of-the-art antibody structure prediction. Its improvement in both accuracy and efficiency makes it a valuable tool for de novo antibody design and could make further improvements in immuno-theory.
翻译:在抗体工程领域,核心任务是设计新型抗体,使其互补位能够以正确表位结合特定抗原。理解抗体结构及其互补位有助于从机制层面揭示其功能。因此,仅根据抗体序列预测其结构始终是抗体从头设计中的高价值问题。结构生物学突破性成果AlphaFold2通过蛋白质序列及计算代价高昂的共进化多序列比对(MSA)提供了结构预测方案。然而,其计算效率不足,且对抗体(尤其是互补决定区CDR)的预测精度有限,制约了其在工业级高通量药物设计中的应用。为学习抗体的信息表征,我们基于Transformer模型对观测抗体空间数据库中的精选序列进行深度抗体语言模型(ALM)训练。进而提出新型模型xTrimoABFold,通过预训练ALM、高效演化因子模块及结构模块,实现从抗体序列到结构的预测。该模型以PDB数据库中的抗体结构进行端到端训练,通过最小化CDR区域特异性焦点损失与帧对齐点损失的集成损失函数优化。xTrimoABFold在RMSD指标上以30%以上的显著优势超越AlphaFold2及基于蛋白质语言模型的最优方法(如OmegaFold、HelixFold-Single、IgFold),同时计算速度较AlphaFold2提升151倍。据我们所知,xTrimoABFold实现了当前最优的抗体结构预测。其在精度与效率上的双重突破使其成为抗体从头设计的利器,并可推动免疫理论的进一步研究。