To enhance large language models (LLMs) for chemistry problem solving, several LLM-based agents augmented with tools have been proposed, such as ChemCrow and Coscientist. However, their evaluations are narrow in scope, leaving a large gap in understanding the benefits of tools across diverse chemistry tasks. To bridge this gap, we develop ChemAgent, an enhanced chemistry agent over ChemCrow, and conduct a comprehensive evaluation of its performance on both specialized chemistry tasks and general chemistry questions. Surprisingly, ChemAgent does not consistently outperform its base LLMs without tools. Our error analysis with a chemistry expert suggests that: For specialized chemistry tasks, such as synthesis prediction, we should augment agents with specialized tools; however, for general chemistry questions like those in exams, agents' ability to reason correctly with chemistry knowledge matters more, and tool augmentation does not always help.
翻译:为提升大语言模型(LLM)解决化学问题的能力,已有多种结合工具的LLM智能体被提出,例如ChemCrow和Coscientist。然而,现有评估范围较为局限,导致对工具在不同类型化学任务中效益的理解存在显著空白。为填补这一空白,我们开发了基于ChemCrow的增强型化学智能体ChemAgent,并对其在专业化学任务和通用化学问题上的性能进行了系统评估。出乎意料的是,ChemAgent并未持续超越未使用工具的基座LLM。我们与化学专家共同进行的错误分析表明:对于合成路线预测等专业化学任务,应为智能体配备专用工具;但对于考试类通用化学问题,智能体运用化学知识进行正确推理的能力更为关键,工具增强并不总能带来助益。