Large Language Models (LLMs) have emerged as a transformative power in enhancing natural language comprehension, representing a significant stride toward artificial general intelligence. The application of LLMs extends beyond conventional linguistic boundaries, encompassing specialized linguistic systems developed within various scientific disciplines. This growing interest has led to the advent of scientific LLMs, a novel subclass specifically engineered for facilitating scientific discovery. As a burgeoning area in the community of AI for Science, scientific LLMs warrant comprehensive exploration. However, a systematic and up-to-date survey introducing them is currently lacking. In this paper, we endeavor to methodically delineate the concept of "scientific language", whilst providing a thorough review of the latest advancements in scientific LLMs. Given the expansive realm of scientific disciplines, our analysis adopts a focused lens, concentrating on the biological and chemical domains. This includes an in-depth examination of LLMs for textual knowledge, small molecules, macromolecular proteins, genomic sequences, and their combinations, analyzing them in terms of model architectures, capabilities, datasets, and evaluation. Finally, we critically examine the prevailing challenges and point out promising research directions along with the advances of LLMs. By offering a comprehensive overview of technical developments in this field, this survey aspires to be an invaluable resource for researchers navigating the intricate landscape of scientific LLMs.
翻译:大型语言模型(LLMs)已成为增强自然语言理解能力的关键驱动力,标志着迈向通用人工智能的重要一步。LLMs的应用超越了传统语言边界,涵盖各科学学科中形成的专业语言系统。这一日益增长的兴趣催生了科学LLMs——专为促进科学发现而设计的新型子类。作为人工智能科学领域中新兴的研究方向,科学LLMs值得全面深入探索。然而,目前尚缺乏系统性且最新的综述文章对其进行介绍。本文旨在系统阐明"科学语言"的概念内涵,同时全面梳理科学LLMs的最新进展。鉴于科学领域之广袤,我们聚焦生物与化学领域展开分析,深入探究面向文本知识、小分子、大分子蛋白质、基因组序列及其组合的LLMs,从模型架构、能力、数据集与评估维度进行系统剖析。最后,我们批判性审视当前面临的挑战,并基于LLMs的发展趋势指出具有前景的研究方向。本综述通过提供该领域技术发展的全景式梳理,旨在为研究者导航科学LLMs的复杂图景提供宝贵资源。