Evaluation of GPT and BERT-based models on identifying protein-protein interactions in biomedical text

Detecting protein-protein interactions (PPIs) is crucial for understanding genetic mechanisms, disease pathogenesis, and drug design. However, with the fast-paced growth of biomedical literature, there is a growing need for automated and accurate extraction of PPIs to facilitate scientific knowledge discovery. Pre-trained language models, such as generative pre-trained transformers (GPT) and bidirectional encoder representations from transformers (BERT), have shown promising results in natural language processing (NLP) tasks. We evaluated the performance of PPI identification of multiple GPT and BERT models using three manually curated gold-standard corpora: Learning Language in Logic (LLL) with 164 PPIs in 77 sentences, Human Protein Reference Database with 163 PPIs in 145 sentences, and Interaction Extraction Performance Assessment with 335 PPIs in 486 sentences. BERT-based models achieved the best overall performance, with BioBERT achieving the highest recall (91.95%) and F1-score (86.84%) and PubMedBERT achieving the highest precision (85.25%). Interestingly, despite not being explicitly trained for biomedical texts, GPT-4 achieved commendable performance, comparable to the top-performing BERT models. It achieved a precision of 88.37%, a recall of 85.14%, and an F1-score of 86.49% on the LLL dataset. These results suggest that GPT models can effectively detect PPIs from text data, offering promising avenues for application in biomedical literature mining. Further research could explore how these models might be fine-tuned for even more specialized tasks within the biomedical domain.

翻译：检测蛋白质-蛋白质相互作用（PPIs）对理解遗传机制、疾病发病机制及药物设计至关重要。然而，随着生物医学文献的快速增长，自动化且准确地提取PPI以促进科学知识发现的需求日益迫切。预训练语言模型，如生成式预训练变换器（GPT）和双向编码器表示变换器（BERT），已在自然语言处理（NLP）任务中展现出优异性能。我们利用三个手动标注的金标准语料库——包含77个句子中164个PPI的“逻辑学习语言”（LLL）语料库、包含145个句子中163个PPI的人类蛋白质参考数据库，以及包含486个句子中335个PPI的交互提取性能评估语料库——评估了多种GPT和BERT模型在PPI识别中的表现。基于BERT的模型取得了最优整体性能，其中BioBERT达到最高召回率（91.95%）和F1分数（86.84%），而PubMedBERT达到最高精确率（85.25%）。有趣的是，尽管未经生物医学文本的显式训练，GPT-4仍取得了与顶尖BERT模型相媲美的出色性能：在LLL数据集上其精确率为88.37%，召回率为85.14%，F1分数为86.49%。这些结果表明，GPT模型可有效从文本数据中检测PPI，为生物医学文献挖掘提供了有前景的应用方向。未来研究可进一步探索如何针对生物医学领域内的更专业任务对这些模型进行微调。