蛋白质适应度预测的进化谱 (Evolutionary Profiles for Protein Fitness Prediction)

Predicting the fitness impact of mutations is central to protein engineering but constrained by limited assays relative to the size of sequence space. Protein language models (pLMs) trained with masked language modeling (MLM) exhibit strong zero-shot fitness prediction; we provide a unifying view by interpreting natural evolution as implicit reward maximization and MLM as inverse reinforcement learning (IRL), in which extant sequences act as expert demonstrations and pLM log-odds serve as fitness estimates. Building on this perspective, we introduce EvoIF, a lightweight model that integrates two complementary sources of evolutionary signal: (i) within-family profiles from retrieved homologs and (ii) cross-family structural-evolutionary constraints distilled from inverse folding logits. EvoIF fuses sequence-structure representations with these profiles via a compact transition block, yielding calibrated probabilities for log-odds scoring. On ProteinGym (217 mutational assays; >2.5M mutants), EvoIF and its MSA-enabled variant achieve state-of-the-art or competitive performance while using only 0.15% of the training data and fewer parameters than recent large models. Ablations confirm that within-family and cross-family profiles are complementary, improving robustness across function types, MSA depths, taxa, and mutation depths. The codes will be made publicly available at https://github.com/aim-uofa/EvoIF.

翻译：预测突变的适应度影响是蛋白质工程的核心问题，但受限于相对于序列空间规模而言有限的实验测定。通过掩码语言建模（MLM）训练的蛋白质语言模型（pLM）展现出强大的零样本适应度预测能力；我们通过将自然进化解释为隐式奖励最大化、将MLM解释为逆强化学习（IRL），提供了一个统一视角，其中现存序列充当专家示范，而pLM的对数几率作为适应度估计。基于这一视角，我们提出了EvoIF——一个轻量级模型，它整合了两种互补的进化信号源：（i）来自检索同源序列的家族内谱，以及（ii）从逆折叠对数几率中提取的跨家族结构-进化约束。EvoIF通过紧凑的转换模块将这些谱与序列-结构表示融合，生成用于对数几率评分的校准概率。在ProteinGym基准测试（217个突变测定；>250万个突变体）中，EvoIF及其支持多序列比对的变体在仅使用0.15%的训练数据和比近期大型模型更少参数的情况下，实现了最先进或具有竞争力的性能。消融实验证实家族内谱与跨家族谱具有互补性，提升了模型在不同功能类型、多序列比对深度、生物分类群及突变深度下的鲁棒性。代码将在https://github.com/aim-uofa/EvoIF 公开。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

Nat. Biotechnol. | 机器学习为生物库驱动的药物发现提供动力

专知会员服务

11+阅读 · 2022年9月12日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日