基于大语言模型的生物信息学进展：组件、应用与展望 (Advancing bioinformatics with large language models: components, applications and perspectives)

Large language models (LLMs) are a class of artificial intelligence models based on deep learning, which have great performance in various tasks, especially in natural language processing (NLP). Large language models typically consist of artificial neural networks with numerous parameters, trained on large amounts of unlabeled input using self-supervised or semi-supervised learning. However, their potential for solving bioinformatics problems may even exceed their proficiency in modeling human language. In this review, we will provide a comprehensive overview of the essential components of large language models (LLMs) in bioinformatics, spanning genomics, transcriptomics, proteomics, drug discovery, and single-cell analysis. Key aspects covered include tokenization methods for diverse data types, the architecture of transformer models, the core attention mechanism, and the pre-training processes underlying these models. Additionally, we will introduce currently available foundation models and highlight their downstream applications across various bioinformatics domains. Finally, drawing from our experience, we will offer practical guidance for both LLM users and developers, emphasizing strategies to optimize their use and foster further innovation in the field.

翻译：大语言模型（LLMs）是一类基于深度学习的人工智能模型，在各类任务中表现出色，尤其在自然语言处理（NLP）领域。大语言模型通常由具有海量参数的人工神经网络构成，通过自监督或半监督学习方式在海量无标注输入上进行训练。然而，其在解决生物信息学问题方面的潜力甚至可能超越其在建模人类语言方面的能力。本综述将全面概述大语言模型在生物信息学中的核心组件，涵盖基因组学、转录组学、蛋白质组学、药物发现和单细胞分析等领域。涉及的关键方面包括面向多样化数据类型的标记化方法、Transformer 模型的架构、核心注意力机制以及这些模型背后的预训练过程。此外，我们将介绍当前可用的基础模型，并重点阐述它们在各类生物信息学子领域中的下游应用。最后，基于我们的实践经验，我们将为 LLM 用户和开发者提供实用指导，着重探讨优化其使用并推动该领域进一步创新的策略。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

《生成式模型: 变分自编码器与扩散模型》，75页ppt，Google DeepMind科学家Ruiqi Gao

专知会员服务

66+阅读 · 2023年6月10日

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

【ACL2020】多模态信息抽取，365页ppt

专知会员服务

151+阅读 · 2020年7月6日