In recent years, Large Language Models (LLMs) have achieved significant success in natural language processing (NLP) and various interdisciplinary areas. However, applying LLMs to chemistry is a complex task that requires specialized domain knowledge. This paper provides a thorough exploration of the nuanced methodologies employed in integrating LLMs into the field of chemistry, delving into the complexities and innovations at this interdisciplinary juncture. Specifically, our analysis begins with examining how molecular information is fed into LLMs through various representation and tokenization methods. We then categorize chemical LLMs into three distinct groups based on the domain and modality of their input data, and discuss approaches for integrating these inputs for LLMs. Furthermore, this paper delves into the pretraining objectives with adaptations to chemical LLMs. After that, we explore the diverse applications of LLMs in chemistry, including novel paradigms for their application in chemistry tasks. Finally, we identify promising research directions, including further integration with chemical knowledge, advancements in continual learning, and improvements in model interpretability, paving the way for groundbreaking developments in the field.
翻译:近年来,大语言模型(LLMs)在自然语言处理(NLP)及各类交叉学科领域取得了显著成功。然而,将LLMs应用于化学是一项需要专业领域知识的复杂任务。本文系统探讨了将LLMs整合至化学领域所采用的精细方法论,深入剖析了这一交叉领域中的复杂性与创新点。具体而言,我们的分析首先审视了通过多种表示与分词方法将分子信息输入LLMs的机制。随后,基于输入数据的领域与模态将化学LLMs划分为三类,并讨论了整合这些输入的方法。此外,本文还深入研究了针对化学LLMs调整的预训练目标。在此基础上,我们探讨了LLMs在化学中的多样化应用,包括其在化学任务中应用的新型范式。最后,我们指出了富有前景的研究方向,包括与化学知识的进一步融合、持续学习的进展以及模型可解释性的提升,为该领域的突破性发展铺平道路。