Large Language Models (LLMs) with strong abilities in natural language processing tasks have emerged and have been applied in various kinds of areas such as science, finance and software engineering. However, the capability of LLMs to advance the field of chemistry remains unclear. In this paper, rather than pursuing state-of-the-art performance, we aim to evaluate capabilities of LLMs in a wide range of tasks across the chemistry domain. We identify three key chemistry-related capabilities including understanding, reasoning and explaining to explore in LLMs and establish a benchmark containing eight chemistry tasks. Our analysis draws on widely recognized datasets facilitating a broad exploration of the capacities of LLMs within the context of practical chemistry. Five LLMs (GPT-4, GPT-3.5, Davinci-003, Llama and Galactica) are evaluated for each chemistry task in zero-shot and few-shot in-context learning settings with carefully selected demonstration examples and specially crafted prompts. Our investigation found that GPT-4 outperformed other models and LLMs exhibit different competitive levels in eight chemistry tasks. In addition to the key findings from the comprehensive benchmark analysis, our work provides insights into the limitation of current LLMs and the impact of in-context learning settings on LLMs' performance across various chemistry tasks. The code and datasets used in this study are available at https://github.com/ChemFoundationModels/ChemLLMBench.
翻译:具有强大自然语言处理能力的大语言模型(LLMs)已涌现并应用于科学、金融、软件工程等多个领域。然而,LLMs推动化学领域发展的能力仍不明确。本文旨在评估LLMs在化学领域广泛任务中的能力,而非追求最先进性能。我们确定了与化学相关的三个关键能力——理解、推理与解释,并构建了包含八项化学任务的基准测试。分析采用广泛认可的公开数据集,系统探索LLMs在实际化学环境中的能力。针对每项化学任务,我们在零样本与少样本上下文学习场景中评估了五种LLMs(GPT-4、GPT-3.5、Davinci-003、Llama与Galactica),并精心选择示例样本与设计提示模板。研究发现GPT-4在八项化学任务中表现优于其他模型,且不同LLMs展现出差异化的竞争水平。除基准测试的核心发现外,本研究还揭示了当前LLMs的局限性,以及上下文学习设置对模型在不同化学任务中表现的影响。本研究所用代码与数据集已开源:https://github.com/ChemFoundationModels/ChemLLMBench。