Large Language Models (LLMs) have emerged as a transformative power in enhancing natural language comprehension, representing a significant stride toward artificial general intelligence. The application of LLMs extends beyond conventional linguistic boundaries, encompassing specialized linguistic systems developed within various scientific disciplines. This growing interest has led to the advent of scientific LLMs, a novel subclass specifically engineered for facilitating scientific discovery. As a burgeoning area in the community of AI for Science, scientific LLMs warrant comprehensive exploration. However, a systematic and up-to-date survey introducing them is currently lacking. In this paper, we endeavor to methodically delineate the concept of "scientific language", whilst providing a thorough review of the latest advancements in scientific LLMs. Given the expansive realm of scientific disciplines, our analysis adopts a focused lens, concentrating on the biological and chemical domains. This includes an in-depth examination of LLMs for textual knowledge, small molecules, macromolecular proteins, genomic sequences, and their combinations, analyzing them in terms of model architectures, capabilities, datasets, and evaluation. Finally, we critically examine the prevailing challenges and point out promising research directions along with the advances of LLMs. By offering a comprehensive overview of technical developments in this field, this survey aspires to be an invaluable resource for researchers navigating the intricate landscape of scientific LLMs.
翻译:大语言模型(LLMs)已成为增强自然语言理解能力的变革性力量,代表着向通用人工智能迈出的重要一步。LLMs的应用已超越传统的语言边界,涵盖了各科学领域内发展出的专业语言系统。这一日益增长的兴趣催生了科学大语言模型这一新兴子类,其专门为促进科学发现而设计。作为“AI for Science”领域中一个快速发展的方向,科学大语言模型值得进行全面探索。然而,目前尚缺乏系统且最新的综述性研究来介绍这一领域。本文致力于系统阐述“科学语言”的概念,同时对科学大语言模型的最新进展进行全面回顾。鉴于科学领域的广阔性,我们的分析采用聚焦视角,集中于生物与化学领域。这包括对面向文本知识、小分子、大分子蛋白质、基因组序列及其组合的LLMs进行深入考察,并从模型架构、能力、数据集和评估等方面进行分析。最后,我们批判性地审视当前面临的挑战,并指出随着大语言模型发展而涌现的潜在研究方向。通过对该领域技术发展提供全面概览,本综述旨在成为研究人员探索科学大语言模型复杂图景的宝贵资源。