ProLLaMA: A Protein Language Model for Multi-Task Protein Language Processing

Large Language Models (LLMs) have achieved remarkable performance in multiple Natural Language Processing (NLP) tasks. Under the premise that protein sequences constitute the protein language, Protein Language Models(PLMs) have advanced the field of protein engineering. However, as of now, unlike LLMs in NLP, PLMs cannot handle the protein understanding task and the protein generation task simultaneously in the Protein Language Processing (PLP) field. This prompts us to delineate the inherent limitations in current PLMs: (i) the lack of natural language capabilities, (ii) insufficient instruction understanding, and (iii) high training resource demands. To address these challenges, we introduce a training framework to transform any general LLM into a PLM capable of handling multiple PLP tasks. To improve training efficiency, we propose Protein Vocabulary Pruning (PVP) for general LLMs. We construct a multi-task instruction dataset containing 13 million samples with superfamily information, facilitating better modeling of protein sequence-function landscapes. Through these methods, we develop the ProLLaMA model, the first known PLM to handle multiple PLP tasks simultaneously. Experiments show that ProLLaMA achieves state-of-the-art results in the unconditional protein sequence generation task. In the controllable protein sequence generation task, ProLLaMA can design novel proteins with desired functionalities. As for the protein understanding task, ProLLaMA achieves a 62\% exact match rate in superfamily prediction. Codes, model weights, and datasets are available at \url{https://github.com/PKU-YuanGroup/ProLLaMA} and \url{https://huggingface.co/GreatCaptainNemo}.

翻译：大型语言模型（LLM）在多项自然语言处理（NLP）任务中取得了显著性能。在蛋白质序列构成蛋白质语言的前提下，蛋白质语言模型（PLM）推动了蛋白质工程领域的发展。然而，迄今为止，与NLP中的LLM不同，PLM无法在蛋白质语言处理（PLP）领域同时处理蛋白质理解任务和蛋白质生成任务。这促使我们阐明当前PLM固有的局限性：（i）缺乏自然语言能力，（ii）指令理解不足，以及（iii）训练资源需求高。为了应对这些挑战，我们引入了一个训练框架，可将任何通用LLM转化为能够处理多项PLP任务的PLM。为了提高训练效率，我们针对通用LLM提出了蛋白质词汇剪枝（PVP）方法。我们构建了一个包含1300万个样本并带有超家族信息的多任务指令数据集，以促进更好地建模蛋白质序列-功能图谱。通过这些方法，我们开发了ProLLaMA模型，这是首个已知的能够同时处理多项PLP任务的PLM。实验表明，ProLLaMA在无条件蛋白质序列生成任务中取得了最先进的结果。在可控蛋白质序列生成任务中，ProLLaMA能够设计具有所需功能的新型蛋白质。在蛋白质理解任务方面，ProLLaMA在超家族预测中达到了62%的精确匹配率。代码、模型权重和数据集可在 \url{https://github.com/PKU-YuanGroup/ProLLaMA} 和 \url{https://huggingface.co/GreatCaptainNemo} 获取。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】基于元内存传输的跨域少镜头语义分割，Remember the Difference: Cross-Domain Few-Shot Semantic Segmentation via Meta-Memory Transfer

专知会员服务

13+阅读 · 2022年3月12日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日