Language Models (LMs) excel in understanding textual descriptions of proteins, as evident in biomedical question-answering tasks. However, their capability falters with raw protein data, such as amino acid sequences, due to a deficit in pretraining on such data. Conversely, Protein Language Models (PLMs) can understand and convert protein data into high-quality representations, but struggle to process texts. To address their limitations, we introduce ProtT3, a framework for Protein-to-Text Generation for Text-based Protein Understanding. ProtT3 empowers an LM to understand protein sequences of amino acids by incorporating a PLM as its protein understanding module, enabling effective protein-to-text generation. This collaboration between PLM and LM is facilitated by a cross-modal projector (i.e., Q-Former) that bridges the modality gap between the PLM's representation space and the LM's input space. Unlike previous studies focusing on protein property prediction and protein-text retrieval, we delve into the largely unexplored field of protein-to-text generation. To facilitate comprehensive benchmarks and promote future research, we establish quantitative evaluations for protein-text modeling tasks, including protein captioning, protein question-answering, and protein-text retrieval. Our experiments show that ProtT3 substantially surpasses current baselines, with ablation studies further highlighting the efficacy of its core components. Our code is available at https://github.com/acharkq/ProtT3.
翻译:语言模型(LMs)在理解蛋白质文本描述方面表现出色,这一点在生物医学问答任务中尤为明显。然而,由于缺乏对原始蛋白质数据(如氨基酸序列)的预训练,其处理此类数据的能力有所欠缺。相反,蛋白质语言模型(PLMs)能够理解蛋白质数据并将其转换为高质量表征,但在文本处理方面存在困难。为解决这些局限,我们提出ProtT3——一种面向文本蛋白质理解的蛋白质到文本生成框架。ProtT3通过将PLM作为蛋白质理解模块嵌入LM,使其能够理解氨基酸序列,从而实现有效的蛋白质到文本生成。PLM与LM之间的协作借助跨模态投影器(即Q-Former)实现,该模块弥合了PLM表征空间与LM输入空间之间的模态差异。与以往聚焦于蛋白质属性预测和蛋白质-文本检索的研究不同,我们深入探索了蛋白质到文本生成这一尚未充分开发的领域。为构建综合性基准并推动未来研究,我们建立了蛋白质-文本建模任务的定量评估体系,包括蛋白质描述生成、蛋白质问答和蛋白质-文本检索。实验表明,ProtT3显著优于现有基线模型,消融研究进一步验证了其核心组件的有效性。我们的代码已开源:https://github.com/acharkq/ProtT3。