Protein language models (PLMs) learn probability distributions over natural protein sequences. By learning from hundreds of millions of natural protein sequences, protein understanding and design capabilities emerge. Recent works have shown that scaling these models improves structure prediction, but does not seem to improve mutation understanding and representation quality for protein function prediction. We introduce PoET-2, a multimodal, retrieval-augmented protein foundation model that incorporates in-context learning of family-specific evolutionary constraints with optional structure conditioning to learn generative distributions over protein sequences. PoET-2 uses a hierarchical transformer encoder that is equivariant to sequence context ordering and a dual decoder architecture with both causal and masked language modeling objectives, allowing PoET-2 to operate in both fully generative and bidirectional representation learning modes. PoET-2 achieves state-of-the-art performance on zero-shot variant effect prediction, excelling at scoring variants with multiple mutations and challenging indel mutations. In supervised settings, PoET-2 embeddings outperform previous methods for learning sequence-function relationships, especially with small datasets. This work highlights the benefits of combining retrieval augmentation with multimodal, family-centric modeling for advancing protein foundation models.
翻译:蛋白质语言模型(PLM)学习天然蛋白质序列的概率分布。通过从数亿条天然蛋白质序列中学习,模型涌现出蛋白质理解与设计能力。近期研究表明,扩展这些模型能提升结构预测性能,但似乎并未改善蛋白质功能预测中的突变理解与表征质量。我们提出了PoET-2——一个多模态检索增强的蛋白质基础模型,它通过整合家族特异性进化约束的上下文学习与可选的结构条件机制,来学习蛋白质序列的生成分布。PoET-2采用对序列上下文顺序具有等变性的分层Transformer编码器,以及兼具因果语言建模与掩码语言建模目标的双解码器架构,使其能在完全生成与双向表征学习两种模式下运行。PoET-2在零样本变异效应预测任务上实现了最先进的性能,尤其在评估含多重突变及复杂插入缺失突变的变异体时表现卓越。在监督学习场景中,PoET-2嵌入表征在学习序列-功能关系方面优于现有方法,在小数据集上优势尤为显著。本研究揭示了将检索增强与多模态、家族中心化建模相结合对于推进蛋白质基础模型发展的积极意义。