Large language models (LLMs) have demonstrated remarkable capabilities in solving complex reasoning tasks, particularly in mathematics. However, the domain of physics reasoning presents unique challenges that have received significantly less attention. Existing benchmarks often fall short in evaluating LLMs' abilities on the breadth and depth of undergraduate-level physics, underscoring the need for a comprehensive evaluation. To fill this gap, we introduce UGPhysics, a large-scale and comprehensive benchmark specifically designed to evaluate UnderGraduate-level Physics (UGPhysics) reasoning with LLMs. UGPhysics includes 5,520 undergraduate-level physics problems in both English and Chinese, covering 13 subjects with seven different answer types and four distinct physics reasoning skills, all rigorously screened for data leakage. Additionally, we develop a Model-Assistant Rule-based Judgment (MARJ) pipeline specifically tailored for assessing answer correctness of physics problems, ensuring accurate evaluation. Our evaluation of 31 leading LLMs shows that the highest overall accuracy, 49.8% (achieved by OpenAI-o1-mini), emphasizes the necessity for models with stronger physics reasoning skills, beyond math abilities. We hope UGPhysics, along with MARJ, will drive future advancements in AI for physics reasoning. Codes and data are available at https://github.com/YangLabHKUST/UGPhysics .
翻译:大语言模型(LLMs)在解决复杂推理任务(尤其在数学领域)已展现出卓越能力。然而,物理推理领域存在独特的挑战,目前受到的关注显著不足。现有基准在评估大语言模型对本科物理知识广度与深度的掌握能力方面存在局限,凸显了构建综合性评估体系的必要性。为填补这一空白,我们提出了UGPhysics——一个专门用于评估大语言模型本科物理推理能力的大规模综合基准。UGPhysics包含5,520道本科物理题目,涵盖中英双语版本,涉及13个学科分支、七种答案类型及四项核心物理推理技能,所有题目均经过严格的数据泄露筛查。此外,我们开发了专用于物理问题答案正确性评估的模型辅助规则判读流程,确保评测准确性。对31个主流大语言模型的评估结果显示,最高总体准确率仅为49.8%(由OpenAI-o1-mini实现),这表明模型需要超越数学能力的更强物理推理技能。我们期望UGPhysics及其配套评估流程能推动人工智能在物理推理领域的未来发展。代码与数据已发布于https://github.com/YangLabHKUST/UGPhysics。