Large Language Models (LLMs) have achieved remarkable performance in multiple Natural Language Processing (NLP) tasks. Under the premise that protein sequences constitute the protein language, Protein Language Models(PLMs) have advanced the field of protein engineering. However, as of now, unlike LLMs in NLP, PLMs cannot handle the protein understanding task and the protein generation task simultaneously in the Protein Language Processing (PLP) field. This prompts us to delineate the inherent limitations in current PLMs: (i) the lack of natural language capabilities, (ii) insufficient instruction understanding, and (iii) high training resource demands. To address these challenges, we introduce a training framework to transform any general LLM into a PLM capable of handling multiple PLP tasks. To improve training efficiency, we propose Protein Vocabulary Pruning (PVP) for general LLMs. We construct a multi-task instruction dataset containing 13 million samples with superfamily information, facilitating better modeling of protein sequence-function landscapes. Through these methods, we develop the ProLLaMA model, the first known PLM to handle multiple PLP tasks simultaneously. Experiments show that ProLLaMA achieves state-of-the-art results in the unconditional protein sequence generation task. In the controllable protein sequence generation task, ProLLaMA can design novel proteins with desired functionalities. As for the protein understanding task, ProLLaMA achieves a 62\% exact match rate in superfamily prediction. Codes, model weights, and datasets are available at \url{https://github.com/PKU-YuanGroup/ProLLaMA} and \url{https://huggingface.co/GreatCaptainNemo}.
翻译:大型语言模型(LLM)在多项自然语言处理(NLP)任务中取得了显著性能。在蛋白质序列构成蛋白质语言的前提下,蛋白质语言模型(PLM)推动了蛋白质工程领域的发展。然而,迄今为止,与NLP中的LLM不同,PLM无法在蛋白质语言处理(PLP)领域同时处理蛋白质理解任务和蛋白质生成任务。这促使我们阐明当前PLM固有的局限性:(i)缺乏自然语言能力,(ii)指令理解不足,以及(iii)训练资源需求高。为了应对这些挑战,我们引入了一个训练框架,可将任何通用LLM转化为能够处理多项PLP任务的PLM。为了提高训练效率,我们针对通用LLM提出了蛋白质词汇剪枝(PVP)方法。我们构建了一个包含1300万个样本并带有超家族信息的多任务指令数据集,以促进更好地建模蛋白质序列-功能图谱。通过这些方法,我们开发了ProLLaMA模型,这是首个已知的能够同时处理多项PLP任务的PLM。实验表明,ProLLaMA在无条件蛋白质序列生成任务中取得了最先进的结果。在可控蛋白质序列生成任务中,ProLLaMA能够设计具有所需功能的新型蛋白质。在蛋白质理解任务方面,ProLLaMA在超家族预测中达到了62%的精确匹配率。代码、模型权重和数据集可在 \url{https://github.com/PKU-YuanGroup/ProLLaMA} 和 \url{https://huggingface.co/GreatCaptainNemo} 获取。