Large Language Models (LLMs) have achieved remarkable progress on advanced reasoning tasks such as mathematics and coding competitions. Meanwhile, physics, despite being both reasoning-intensive and essential to real-world understanding, received limited academic and industrial attention. This paper introduces PHYSICS, a dataset containing 16,568 high-quality physics problems spanning subjects and difficulty levels, to facilitate this issue. Specifically, PHYSICS is curated with exercises from over 100 textbooks through a carefully designed pipeline for quality control. It covers five major physics domains: Mechanics, Electromagnetism, Thermodynamics, Optics, and Modern Physics. It also spans a wide range of difficulty levels, from high school to graduate-level physics courses. To utilize the data for improving and evaluating the model's physical reasoning capabilities, we split the dataset into training and test sets, and provide reasoning paths generated by powerful reasoning models for the training data to facilitate model training. In addition, for the evaluation part, we find that existing evaluation frameworks exhibit biases in aspects such as units, simplification, and precision in physics domain. To balance efficiency and accuracy, we introduce a Rule+Model evaluation framework tailored to physics problems. Our evaluations on current state-of-the-art open-source and proprietary models highlight the limitations of current models in handling physics-related tasks. We hope that our dataset and evaluation methodology will jointly advance the development of LLMs in the field of physics. The code and data can be found at: https://github.com/Zhengsh123/PHYSICS.
翻译:大型语言模型(LLMs)在数学和编程竞赛等高级推理任务上取得了显著进展。然而,物理学作为兼具推理强度与现实世界理解重要性的学科,在学术界与工业界获得的关注相对有限。本文为此引入PHYSICS数据集,该数据集包含16,568个涵盖多学科与难度级别的高质量物理问题。具体而言,PHYSICS通过精心设计的质量控制流程,从超过100本教科书中筛选习题构建而成。它覆盖五大物理领域:力学、电磁学、热力学、光学与现代物理学,并涵盖从高中至研究生物理课程的广泛难度范围。为利用数据提升与评估模型的物理推理能力,我们将数据集划分为训练集与测试集,并为训练数据提供由强推理模型生成的推理路径以辅助模型训练。此外,在评估方面,我们发现现有评估框架在物理领域的单位、简化和精度等方面存在偏差。为平衡效率与准确性,我们提出了专为物理问题设计的“规则+模型”评估框架。通过对当前最先进的开源与专有模型的评估,我们揭示了现有模型在处理物理相关任务时的局限性。我们希望本数据集与评估方法能共同推动LLMs在物理学领域的发展。代码与数据可在以下网址获取:https://github.com/Zhengsh123/PHYSICS。