We introduce ChemPro, a progressive benchmark with 4100 natural language question-answer pairs in Chemistry, across 4 coherent sections of difficulty designed to assess the proficiency of Large Language Models (LLMs) in a broad spectrum of general chemistry topics. We include Multiple Choice Questions and Numerical Questions spread across fine-grained information recall, long-horizon reasoning, multi-concept questions, problem-solving with nuanced articulation, and straightforward questions in a balanced ratio, effectively covering Bio-Chemistry, Inorganic-Chemistry, Organic-Chemistry and Physical-Chemistry. ChemPro is carefully designed analogous to a student's academic evaluation for basic to high-school chemistry. A gradual increase in the question difficulty rigorously tests the ability of LLMs to progress from solving basic problems to solving more sophisticated challenges. We evaluate 45+7 state-of-the-art LLMs, spanning both open-source and proprietary variants, and our analysis reveals that while LLMs perform well on basic chemistry questions, their accuracy declines with different types and levels of complexity. These findings highlight the critical limitations of LLMs in general scientific reasoning and understanding and point towards understudied dimensions of difficulty, emphasizing the need for more robust methodologies to improve LLMs.
翻译:我们提出了ChemPro,一个包含4100个自然语言问答对的渐进式化学基准测试,涵盖4个难度递进的连贯部分,旨在评估大型语言模型(LLMs)在广泛普通化学主题中的能力。该基准以均衡比例涵盖多项选择题和数值计算题,细粒度分布于信息记忆、长程推理、多概念综合、需精细表述的问题求解及基础性问题等维度,全面覆盖生物化学、无机化学、有机化学与物理化学领域。ChemPro参照学生从基础到高中化学的学业评估体系精心设计,通过问题难度的渐进式提升,系统检验LLMs从解决基础问题到应对复杂挑战的进阶能力。我们对45+7个开源与专有前沿LLMs进行了评估,分析表明:虽然LLMs在基础化学问题上表现良好,但其准确性随问题类型和复杂程度的增加而显著下降。这些发现揭示了LLMs在通用科学推理与理解方面的关键局限,指出了尚未被充分研究的难度维度,并强调需要更稳健的方法论来改进LLMs。