Bridge-IF: Learning Inverse Protein Folding with Markov Bridges

Inverse protein folding is a fundamental task in computational protein design, which aims to design protein sequences that fold into the desired backbone structures. While the development of machine learning algorithms for this task has seen significant success, the prevailing approaches, which predominantly employ a discriminative formulation, frequently encounter the error accumulation issue and often fail to capture the extensive variety of plausible sequences. To fill these gaps, we propose Bridge-IF, a generative diffusion bridge model for inverse folding, which is designed to learn the probabilistic dependency between the distributions of backbone structures and protein sequences. Specifically, we harness an expressive structure encoder to propose a discrete, informative prior derived from structures, and establish a Markov bridge to connect this prior with native sequences. During the inference stage, Bridge-IF progressively refines the prior sequence, culminating in a more plausible design. Moreover, we introduce a reparameterization perspective on Markov bridge models, from which we derive a simplified loss function that facilitates more effective training. We also modulate protein language models (PLMs) with structural conditions to precisely approximate the Markov bridge process, thereby significantly enhancing generation performance while maintaining parameter-efficient training. Extensive experiments on well-established benchmarks demonstrate that Bridge-IF predominantly surpasses existing baselines in sequence recovery and excels in the design of plausible proteins with high foldability. The code is available at https://github.com/violet-sto/Bridge-IF.

翻译：逆蛋白质折叠是计算蛋白质设计中的一项基础任务，其目标在于设计能够折叠成预期主链结构的蛋白质序列。尽管针对该任务的机器学习算法已取得显著进展，但当前主流方法主要采用判别式建模，常面临误差累积问题，且难以捕捉潜在序列的广泛多样性。为填补这些不足，我们提出Bridge-IF——一种用于逆折叠的生成式扩散桥模型，旨在学习主链结构分布与蛋白质序列分布之间的概率依赖关系。具体而言，我们利用表达能力强的结构编码器构建一个源自结构的离散化信息先验，并通过马尔可夫桥将该先验与天然序列相连接。在推断阶段，Bridge-IF逐步优化先验序列，最终生成更合理的蛋白质设计。此外，我们提出马尔可夫桥模型的重参数化视角，并据此推导出简化的损失函数以提升训练效率。我们进一步通过结构条件调制蛋白质语言模型（PLMs），以精确逼近马尔可夫桥过程，从而在保持参数高效训练的同时显著提升生成性能。在权威基准测试上的大量实验表明，Bridge-IF在序列恢复率方面显著超越现有基线方法，并在设计具有高可折叠性的合理蛋白质方面表现优异。代码已发布于 https://github.com/violet-sto/Bridge-IF。