Scientific Large Language Models: A Survey on Biological & Chemical Domains

Qiang Zhang,Keyang Ding,Tianwen Lyv,Xinda Wang,Qingyu Yin,Yiwen Zhang,Jing Yu,Yuhao Wang,Xiaotong Li,Zhuoyi Xiang,Xiang Zhuang,Zeyuan Wang,Ming Qin,Mengyao Zhang,Jinlu Zhang,Jiyu Cui,Renjun Xu,Hongyang Chen,Xiaohui Fan,Huabin Xing,Huajun Chen

Large Language Models (LLMs) have emerged as a transformative power in enhancing natural language comprehension, representing a significant stride toward artificial general intelligence. The application of LLMs extends beyond conventional linguistic boundaries, encompassing specialized linguistic systems developed within various scientific disciplines. This growing interest has led to the advent of scientific LLMs, a novel subclass specifically engineered for facilitating scientific discovery. As a burgeoning area in the community of AI for Science, scientific LLMs warrant comprehensive exploration. However, a systematic and up-to-date survey introducing them is currently lacking. In this paper, we endeavor to methodically delineate the concept of "scientific language", whilst providing a thorough review of the latest advancements in scientific LLMs. Given the expansive realm of scientific disciplines, our analysis adopts a focused lens, concentrating on the biological and chemical domains. This includes an in-depth examination of LLMs for textual knowledge, small molecules, macromolecular proteins, genomic sequences, and their combinations, analyzing them in terms of model architectures, capabilities, datasets, and evaluation. Finally, we critically examine the prevailing challenges and point out promising research directions along with the advances of LLMs. By offering a comprehensive overview of technical developments in this field, this survey aspires to be an invaluable resource for researchers navigating the intricate landscape of scientific LLMs.

翻译：大型语言模型（LLMs）已成为增强自然语言理解能力的关键驱动力，标志着迈向通用人工智能的重要一步。LLMs的应用超越了传统语言边界，涵盖各科学学科中形成的专业语言系统。这一日益增长的兴趣催生了科学LLMs——专为促进科学发现而设计的新型子类。作为人工智能科学领域中新兴的研究方向，科学LLMs值得全面深入探索。然而，目前尚缺乏系统性且最新的综述文章对其进行介绍。本文旨在系统阐明"科学语言"的概念内涵，同时全面梳理科学LLMs的最新进展。鉴于科学领域之广袤，我们聚焦生物与化学领域展开分析，深入探究面向文本知识、小分子、大分子蛋白质、基因组序列及其组合的LLMs，从模型架构、能力、数据集与评估维度进行系统剖析。最后，我们批判性审视当前面临的挑战，并基于LLMs的发展趋势指出具有前景的研究方向。本综述通过提供该领域技术发展的全景式梳理，旨在为研究者导航科学LLMs的复杂图景提供宝贵资源。

相关内容

MoDELS

关注 46

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日