Proteins are shaped by gradual evolution under biophysical and functional constraints. Protein language models learn rich evolutionary constraints from large-scale sequences, and discrete diffusion-based protein language models~(\eg, DPLMs) are promising for both understanding and generation. However, existing DPLMs typically rely on masked diffusion that contradicts a simple biological intuition: proteins evolve through accumulated edits, not by emerging from masks. Consequently, these frameworks lack explicit pretraining objectives for substitution and insertion/deletion (indel) operations, limiting both optimization-style post-editing and flexible guided generation. To address these limitations, we present DPLM-Evo, an evolutionary discrete diffusion framework that explicitly predicts substitution, insertion, and deletion operations during denoising. DPLM-Evo decouples an upsampled-length latent alignment space from the variable-length observed sequence space, which makes indel-aware generation tractable. To better align substitutions with real evolution, we further introduce a contextualized evolutionary noising kernel that produces biologically informed, context-dependent mutation patterns. Across tasks, DPLM-Evo improves sequence understanding and achieves state-of-the-art mutation effect prediction performance on ProteinGym in the single-sequence setting. It also enables variable-length simulated evolution, and post-editing/optimization of existing proteins via explicit edit trajectories.
翻译:蛋白质在生物物理和功能约束下通过逐步进化塑造而成。蛋白质语言模型从大规模序列中学习丰富的进化约束,基于离散扩散的蛋白质语言模型(如DPLM)在理解和生成方面均展现出潜力。然而,现有DPLM通常依赖掩码扩散,这与一个简单生物学直觉相悖:蛋白质是通过累积编辑进化而来,而非从掩码中涌现。因此,这些框架缺乏针对替换和插入/删除(indel)操作的显式预训练目标,限制了优化式后期编辑和灵活的引导生成。为解决这些局限,我们提出DPLM-Evo——一种进化离散扩散框架,在去噪过程中显式预测替换、插入和删除操作。DPLM-Evo将上采样长度的潜在对齐空间与可变长度的观测序列空间解耦,使支持indel操作的生成变得可行。为进一步使替换与真实进化对齐,我们引入一种上下文感知的进化噪声核,生成具有生物学意义、上下文依赖的突变模式。在多项任务中,DPLM-Evo提升了对序列的理解能力,并在单序列设定下于ProteinGym上实现了最先进的突变效应预测性能。它还支持可变长度模拟进化,以及通过显式编辑轨迹对现有蛋白质进行后期编辑/优化。