Large language models (LLMs) are a class of artificial intelligence models based on deep learning, which have great performance in various tasks, especially in natural language processing (NLP). Large language models typically consist of artificial neural networks with numerous parameters, trained on large amounts of unlabeled input using self-supervised or semi-supervised learning. However, their potential for solving bioinformatics problems may even exceed their proficiency in modeling human language. In this review, we will present a summary of the prominent large language models used in natural language processing, such as BERT and GPT, and focus on exploring the applications of large language models at different omics levels in bioinformatics, mainly including applications of large language models in genomics, transcriptomics, proteomics, drug discovery and single cell analysis. Finally, this review summarizes the potential and prospects of large language models in solving bioinformatic problems.
翻译:大型语言模型(LLMs)是一类基于深度学习的人工智能模型,在各项任务中表现出色,尤其在自然语言处理(NLP)领域。大型语言模型通常由包含大量参数的人工神经网络构成,通过自监督或半监督学习对大量无标签输入进行训练。然而,它们在解决生物信息学问题方面的潜力甚至可能超越其建模人类语言的能力。本综述将概述自然语言处理中主流的大型语言模型(如BERT和GPT),并重点探讨大型语言模型在生物信息学不同组学层面的应用,主要包括基因组学、转录组学、蛋白质组学、药物发现以及单细胞分析中的应用。最后,本综述总结了大型语言模型在解决生物信息学问题方面的潜力与前景。