Proteins, as essential biomolecules, play a central role in biological processes, including metabolic reactions and DNA replication. Accurate prediction of their properties and functions is crucial in biological applications. Recent development of protein language models (pLMs) with supervised fine tuning provides a promising solution to this problem. However, the fine-tuned model is tailored for particular downstream prediction task, and achieving general-purpose protein understanding remains a challenge. In this paper, we introduce Structure-Enhanced Protein Instruction Tuning (SEPIT) framework to bridge this gap. Our approach integrates a noval structure-aware module into pLMs to inform them with structural knowledge, and then connects these enhanced pLMs to large language models (LLMs) to generate understanding of proteins. In this framework, we propose a novel two-stage instruction tuning pipeline that first establishes a basic understanding of proteins through caption-based instructions and then refines this understanding using a mixture of experts (MoEs) to learn more complex properties and functional information with the same amount of activated parameters. Moreover, we construct the largest and most comprehensive protein instruction dataset to date, which allows us to train and evaluate the general-purpose protein understanding model. Extensive experimental results on open-ended generation and closed-set answer tasks demonstrate the superior performance of SEPIT over both closed-source general LLMs and open-source LLMs trained with protein knowledge.
翻译:蛋白质作为重要的生物分子,在包括代谢反应和DNA复制在内的生物过程中发挥着核心作用。准确预测其性质和功能对于生物应用至关重要。近年来,通过监督微调的蛋白质语言模型(pLMs)发展为该问题提供了有前景的解决方案。然而,经过微调的模型通常仅针对特定的下游预测任务,实现通用蛋白质理解仍然是一个挑战。本文中,我们引入了结构增强蛋白质指令微调(SEPIT)框架以弥合这一差距。我们的方法将新颖的结构感知模块集成到pLMs中,为其提供结构知识,然后将这些增强的pLMs与大型语言模型(LLMs)连接,以生成对蛋白质的理解。在此框架中,我们提出了一种新颖的两阶段指令微调流程:首先通过基于描述的指令建立对蛋白质的基础理解,然后利用专家混合(MoEs)在相同激活参数量下学习更复杂的性质与功能信息,从而精炼这一理解。此外,我们构建了迄今为止规模最大、最全面的蛋白质指令数据集,使我们能够训练和评估通用蛋白质理解模型。在开放式生成和封闭式答案任务上的大量实验结果表明,SEPIT在性能上优于闭源通用LLMs以及经过蛋白质知识训练的开源LLMs。