Retrieved Sequence Augmentation for Protein Representation Learning

Protein language models have excelled in a variety of tasks, ranging from structure prediction to protein engineering. However, proteins are highly diverse in functions and structures, and current state-of-the-art models including the latest version of AlphaFold rely on Multiple Sequence Alignments (MSA) to feed in the evolutionary knowledge. Despite their success, heavy computational overheads, as well as the de novo and orphan proteins remain great challenges in protein representation learning. In this work, we show that MSAaugmented models inherently belong to retrievalaugmented methods. Motivated by this finding, we introduce Retrieved Sequence Augmentation(RSA) for protein representation learning without additional alignment or pre-processing. RSA links query protein sequences to a set of sequences with similar structures or properties in the database and combines these sequences for downstream prediction. We show that protein language models benefit from the retrieval enhancement on both structure prediction and property prediction tasks, with a 5% improvement on MSA Transformer on average while being 373 times faster. In addition, we show that our model can transfer to new protein domains better and outperforms MSA Transformer on de novo protein prediction. Our study fills a much-encountered gap in protein prediction and brings us a step closer to demystifying the domain knowledge needed to understand protein sequences. Code is available on https://github.com/HKUNLP/RSA.

翻译：蛋白质语言模型在从结构预测到蛋白质工程的各种任务中表现出色。然而，蛋白质在功能和结构上高度多样化，当前最先进的模型（包括最新版本的AlphaFold）依赖多重序列比对（MSA）来注入进化知识。尽管取得了成功，但巨大的计算开销以及从头蛋白质和孤儿蛋白质仍是蛋白质表示学习中的重大挑战。在本工作中，我们表明MSA增强模型本质上属于检索增强方法。受此发现的启发，我们引入了检索序列增强（RSA），用于无需额外对齐或预处理的蛋白质表示学习。RSA将查询蛋白质序列与数据库中具有相似结构或属性的序列集关联起来，并结合这些序列进行下游预测。我们显示，蛋白质语言模型在结构预测和属性预测任务中都受益于检索增强，在MSA Transformer上平均提升5%，同时速度快373倍。此外，我们表明我们的模型能更好地迁移到新蛋白质域，并在从头蛋白质预测上优于MSA Transformer。我们的研究填补了蛋白质预测中一个常见空白，并使我们在理解蛋白质序列所需的领域知识方面更近一步。代码可在https://github.com/HKUNLP/RSA获取。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

百篇论文纵览大型语言模型最新研究进展

专知会员服务

70+阅读 · 2023年3月31日

【干货书】深度学习合成数据，354页pdf，Synthetic Data for Deep Learning

专知会员服务

105+阅读 · 2022年2月10日

多标签学习的新趋势（2020 Survey）

专知会员服务

44+阅读 · 2020年12月6日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

167+阅读 · 2020年3月18日