Scientific Large Language Models: A Survey on Biological & Chemical Domains

Qiang Zhang,Keyang Ding,Tianwen Lyv,Xinda Wang,Qingyu Yin,Yiwen Zhang,Jing Yu,Yuhao Wang,Xiaotong Li,Zhuoyi Xiang,Kehua Feng,Xiang Zhuang,Zeyuan Wang,Ming Qin,Mengyao Zhang,Jinlu Zhang,Jiyu Cui,Tao Huang,Pengju Yan,Renjun Xu,Hongyang Chen,Xiaolin Li,Xiaohui Fan,Huabin Xing,Huajun Chen

Large Language Models (LLMs) have emerged as a transformative power in enhancing natural language comprehension, representing a significant stride toward artificial general intelligence. The application of LLMs extends beyond conventional linguistic boundaries, encompassing specialized linguistic systems developed within various scientific disciplines. This growing interest has led to the advent of scientific LLMs, a novel subclass specifically engineered for facilitating scientific discovery. As a burgeoning area in the community of AI for Science, scientific LLMs warrant comprehensive exploration. However, a systematic and up-to-date survey introducing them is currently lacking. In this paper, we endeavor to methodically delineate the concept of "scientific language", whilst providing a thorough review of the latest advancements in scientific LLMs. Given the expansive realm of scientific disciplines, our analysis adopts a focused lens, concentrating on the biological and chemical domains. This includes an in-depth examination of LLMs for textual knowledge, small molecules, macromolecular proteins, genomic sequences, and their combinations, analyzing them in terms of model architectures, capabilities, datasets, and evaluation. Finally, we critically examine the prevailing challenges and point out promising research directions along with the advances of LLMs. By offering a comprehensive overview of technical developments in this field, this survey aspires to be an invaluable resource for researchers navigating the intricate landscape of scientific LLMs.

翻译：大语言模型（LLMs）已成为增强自然语言理解能力的变革性力量，代表着向通用人工智能迈出的重要一步。LLMs的应用已超越传统的语言边界，涵盖了各科学领域内发展出的专业语言系统。这一日益增长的兴趣催生了科学大语言模型这一新兴子类，其专门为促进科学发现而设计。作为“AI for Science”领域中一个快速发展的方向，科学大语言模型值得进行全面探索。然而，目前尚缺乏系统且最新的综述性研究来介绍这一领域。本文致力于系统阐述“科学语言”的概念，同时对科学大语言模型的最新进展进行全面回顾。鉴于科学领域的广阔性，我们的分析采用聚焦视角，集中于生物与化学领域。这包括对面向文本知识、小分子、大分子蛋白质、基因组序列及其组合的LLMs进行深入考察，并从模型架构、能力、数据集和评估等方面进行分析。最后，我们批判性地审视当前面临的挑战，并指出随着大语言模型发展而涌现的潜在研究方向。通过对该领域技术发展提供全面概览，本综述旨在成为研究人员探索科学大语言模型复杂图景的宝贵资源。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日