Predicting the fitness impact of mutations is central to protein engineering but constrained by limited assays relative to the size of sequence space. Protein language models (pLMs) trained with masked language modeling (MLM) exhibit strong zero-shot fitness prediction; we provide a unifying view by interpreting natural evolution as implicit reward maximization and MLM as inverse reinforcement learning (IRL), in which extant sequences act as expert demonstrations and pLM log-odds serve as fitness estimates. Building on this perspective, we introduce EvoIF, a lightweight model that integrates two complementary sources of evolutionary signal: (i) within-family profiles from retrieved homologs and (ii) cross-family structural-evolutionary constraints distilled from inverse folding logits. EvoIF fuses sequence-structure representations with these profiles via a compact transition block, yielding calibrated probabilities for log-odds scoring. On ProteinGym (217 mutational assays; >2.5M mutants), EvoIF and its MSA-enabled variant achieve state-of-the-art or competitive performance while using only 0.15% of the training data and fewer parameters than recent large models. Ablations confirm that within-family and cross-family profiles are complementary, improving robustness across function types, MSA depths, taxa, and mutation depths. The codes will be made publicly available at https://github.com/aim-uofa/EvoIF.
翻译:预测突变的适应度影响是蛋白质工程的核心问题,但受限于相对于序列空间规模而言有限的实验测定。通过掩码语言建模(MLM)训练的蛋白质语言模型(pLM)展现出强大的零样本适应度预测能力;我们通过将自然进化解释为隐式奖励最大化、将MLM解释为逆强化学习(IRL),提供了一个统一视角,其中现存序列充当专家示范,而pLM的对数几率作为适应度估计。基于这一视角,我们提出了EvoIF——一个轻量级模型,它整合了两种互补的进化信号源:(i)来自检索同源序列的家族内谱,以及(ii)从逆折叠对数几率中提取的跨家族结构-进化约束。EvoIF通过紧凑的转换模块将这些谱与序列-结构表示融合,生成用于对数几率评分的校准概率。在ProteinGym基准测试(217个突变测定;>250万个突变体)中,EvoIF及其支持多序列比对的变体在仅使用0.15%的训练数据和比近期大型模型更少参数的情况下,实现了最先进或具有竞争力的性能。消融实验证实家族内谱与跨家族谱具有互补性,提升了模型在不同功能类型、多序列比对深度、生物分类群及突变深度下的鲁棒性。代码将在https://github.com/aim-uofa/EvoIF 公开。