Most of the existing Large Language Model (LLM) benchmarks on scientific problem reasoning focus on problems grounded in high-school subjects and are confined to elementary algebraic operations. To systematically examine the reasoning capabilities required for solving complex scientific problems, we introduce an expansive benchmark suite SciBench for LLMs. SciBench contains a carefully curated dataset featuring a range of collegiate-level scientific problems from mathematics, chemistry, and physics domains. Based on the dataset, we conduct an in-depth benchmarking study of representative open-source and proprietary LLMs with various prompting strategies. The results reveal that the current LLMs fall short of delivering satisfactory performance, with the best overall score of merely 43.22%. Furthermore, through a detailed user study, we categorize the errors made by LLMs into ten problem-solving abilities. Our analysis indicates that no single prompting strategy significantly outperforms the others and some strategies that demonstrate improvements in certain problem-solving skills could result in declines in other skills. We envision that SciBench will catalyze further developments in the reasoning abilities of LLMs, thereby ultimately contributing to scientific research and discovery.
翻译:现有的大语言模型(LLM)科学问题推理基准大多基于高中学科知识,且局限于初等代数运算。为系统检验解决复杂科学问题所需的推理能力,我们引入了一个面向LLM的扩展基准套件SciBench。SciBench包含一个精心构建的数据集,涵盖数学、化学和物理领域的多类本科层次科学问题。基于该数据集,我们采用多种提示策略对代表性的开源与专有LLM进行了深入基准测试。结果表明,当前LLM的表现尚未达到令人满意的水平,最高综合得分仅为43.22%。此外,通过详细的用户研究,我们将LLM产生的错误归类为十种问题解决能力。分析表明,没有单一提示策略能显著优于其他策略,且某些在特定问题解决技能上表现出改进的策略可能导致其他技能的退化。我们预期SciBench将推动LLM推理能力的进一步发展,最终为科学研究与发现作出贡献。