Large language models (LLMs) are a class of artificial intelligence models based on deep learning, which have great performance in various tasks, especially in natural language processing (NLP). Large language models typically consist of artificial neural networks with numerous parameters, trained on large amounts of unlabeled input using self-supervised or semi-supervised learning. However, their potential for solving bioinformatics problems may even exceed their proficiency in modeling human language. In this review, we will provide a comprehensive overview of the essential components of large language models (LLMs) in bioinformatics, spanning genomics, transcriptomics, proteomics, drug discovery, and single-cell analysis. Key aspects covered include tokenization methods for diverse data types, the architecture of transformer models, the core attention mechanism, and the pre-training processes underlying these models. Additionally, we will introduce currently available foundation models and highlight their downstream applications across various bioinformatics domains. Finally, drawing from our experience, we will offer practical guidance for both LLM users and developers, emphasizing strategies to optimize their use and foster further innovation in the field.
翻译:大语言模型(LLMs)是一类基于深度学习的人工智能模型,在各类任务中表现出色,尤其在自然语言处理(NLP)领域。大语言模型通常由具有海量参数的人工神经网络构成,通过自监督或半监督学习方式在海量无标注输入上进行训练。然而,其在解决生物信息学问题方面的潜力甚至可能超越其在建模人类语言方面的能力。本综述将全面概述大语言模型在生物信息学中的核心组件,涵盖基因组学、转录组学、蛋白质组学、药物发现和单细胞分析等领域。涉及的关键方面包括面向多样化数据类型的标记化方法、Transformer 模型的架构、核心注意力机制以及这些模型背后的预训练过程。此外,我们将介绍当前可用的基础模型,并重点阐述它们在各类生物信息学子领域中的下游应用。最后,基于我们的实践经验,我们将为 LLM 用户和开发者提供实用指导,着重探讨优化其使用并推动该领域进一步创新的策略。