Large Language Models (LLMs) with strong abilities in natural language processing tasks have emerged and have been applied in various kinds of areas such as science, finance and software engineering. However, the capability of LLMs to advance the field of chemistry remains unclear. In this paper, rather than pursuing state-of-the-art performance, we aim to evaluate capabilities of LLMs in a wide range of tasks across the chemistry domain. We identify three key chemistry-related capabilities including understanding, reasoning and explaining to explore in LLMs and establish a benchmark containing eight chemistry tasks. Our analysis draws on widely recognized datasets facilitating a broad exploration of the capacities of LLMs within the context of practical chemistry. Five LLMs (GPT-4, GPT-3.5, Davinci-003, Llama and Galactica) are evaluated for each chemistry task in zero-shot and few-shot in-context learning settings with carefully selected demonstration examples and specially crafted prompts. Our investigation found that GPT-4 outperformed other models and LLMs exhibit different competitive levels in eight chemistry tasks. In addition to the key findings from the comprehensive benchmark analysis, our work provides insights into the limitation of current LLMs and the impact of in-context learning settings on LLMs' performance across various chemistry tasks. The code and datasets used in this study are available at https://github.com/ChemFoundationModels/ChemLLMBench.
翻译:具备强大自然语言处理能力的大型语言模型(LLMs)已涌现,并被应用于科学、金融和软件工程等多个领域。然而,LLMs推动化学领域发展的能力仍不明确。本文不追求最优性能,而是旨在评估LLMs在化学领域广泛任务中的能力。我们识别了LLMs在化学领域需要探索的三个关键能力:理解、推理和解释,并构建了一个包含八项化学任务的基准测试。我们的分析采用广泛认可的公开数据集,以全面探索LLMs在实际化学场景中的能力。针对每项化学任务,我们评估了五种LLMs(GPT-4、GPT-3.5、Davinci-003、Llama和Galactica)在零样本和少样本情境学习设置下的表现,并精心挑选示例及特别设计的提示词。研究发现,GPT-4优于其他模型,且LLMs在八项化学任务中展现出不同的竞争力水平。除基准分析的关键发现外,本文还揭示了当前LLMs的局限性,以及情境学习设置对LLMs在不同化学任务中表现的影响。本研究所用代码和数据集已公开于https://github.com/ChemFoundationModels/ChemLLMBench。