Protein language models have excelled in a variety of tasks, ranging from structure prediction to protein engineering. However, proteins are highly diverse in functions and structures, and current state-of-the-art models including the latest version of AlphaFold rely on Multiple Sequence Alignments (MSA) to feed in the evolutionary knowledge. Despite their success, heavy computational overheads, as well as the de novo and orphan proteins remain great challenges in protein representation learning. In this work, we show that MSAaugmented models inherently belong to retrievalaugmented methods. Motivated by this finding, we introduce Retrieved Sequence Augmentation(RSA) for protein representation learning without additional alignment or pre-processing. RSA links query protein sequences to a set of sequences with similar structures or properties in the database and combines these sequences for downstream prediction. We show that protein language models benefit from the retrieval enhancement on both structure prediction and property prediction tasks, with a 5% improvement on MSA Transformer on average while being 373 times faster. In addition, we show that our model can transfer to new protein domains better and outperforms MSA Transformer on de novo protein prediction. Our study fills a much-encountered gap in protein prediction and brings us a step closer to demystifying the domain knowledge needed to understand protein sequences. Code is available on https://github.com/HKUNLP/RSA.
翻译:蛋白质语言模型在从结构预测到蛋白质工程的各种任务中表现出色。然而,蛋白质在功能和结构上高度多样化,当前最先进的模型(包括最新版本的AlphaFold)依赖多重序列比对(MSA)来注入进化知识。尽管取得了成功,但巨大的计算开销以及从头蛋白质和孤儿蛋白质仍是蛋白质表示学习中的重大挑战。在本工作中,我们表明MSA增强模型本质上属于检索增强方法。受此发现的启发,我们引入了检索序列增强(RSA),用于无需额外对齐或预处理的蛋白质表示学习。RSA将查询蛋白质序列与数据库中具有相似结构或属性的序列集关联起来,并结合这些序列进行下游预测。我们显示,蛋白质语言模型在结构预测和属性预测任务中都受益于检索增强,在MSA Transformer上平均提升5%,同时速度快373倍。此外,我们表明我们的模型能更好地迁移到新蛋白质域,并在从头蛋白质预测上优于MSA Transformer。我们的研究填补了蛋白质预测中一个常见空白,并使我们在理解蛋白质序列所需的领域知识方面更近一步。代码可在https://github.com/HKUNLP/RSA获取。