UGPhysics: A Comprehensive Benchmark for Undergraduate Physics Reasoning with Large Language Models

Large language models (LLMs) have demonstrated remarkable capabilities in solving complex reasoning tasks, particularly in mathematics. However, the domain of physics reasoning presents unique challenges that have received significantly less attention. Existing benchmarks often fall short in evaluating LLMs' abilities on the breadth and depth of undergraduate-level physics, underscoring the need for a comprehensive evaluation. To fill this gap, we introduce UGPhysics, a large-scale and comprehensive benchmark specifically designed to evaluate UnderGraduate-level Physics (UGPhysics) reasoning with LLMs. UGPhysics includes 5,520 undergraduate-level physics problems in both English and Chinese, covering 13 subjects with seven different answer types and four distinct physics reasoning skills, all rigorously screened for data leakage. Additionally, we develop a Model-Assistant Rule-based Judgment (MARJ) pipeline specifically tailored for assessing answer correctness of physics problems, ensuring accurate evaluation. Our evaluation of 31 leading LLMs shows that the highest overall accuracy, 49.8% (achieved by OpenAI-o1-mini), emphasizes the necessity for models with stronger physics reasoning skills, beyond math abilities. We hope UGPhysics, along with MARJ, will drive future advancements in AI for physics reasoning. Codes and data are available at https://github.com/YangLabHKUST/UGPhysics .

翻译：大语言模型（LLMs）在解决复杂推理任务（尤其在数学领域）已展现出卓越能力。然而，物理推理领域存在独特的挑战，目前受到的关注显著不足。现有基准在评估大语言模型对本科物理知识广度与深度的掌握能力方面存在局限，凸显了构建综合性评估体系的必要性。为填补这一空白，我们提出了UGPhysics——一个专门用于评估大语言模型本科物理推理能力的大规模综合基准。UGPhysics包含5,520道本科物理题目，涵盖中英双语版本，涉及13个学科分支、七种答案类型及四项核心物理推理技能，所有题目均经过严格的数据泄露筛查。此外，我们开发了专用于物理问题答案正确性评估的模型辅助规则判读流程，确保评测准确性。对31个主流大语言模型的评估结果显示，最高总体准确率仅为49.8%（由OpenAI-o1-mini实现），这表明模型需要超越数学能力的更强物理推理技能。我们期望UGPhysics及其配套评估流程能推动人工智能在物理推理领域的未来发展。代码与数据已发布于https://github.com/YangLabHKUST/UGPhysics。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日