Large language models (LLMs) have been extensively studied for tasks like math competitions, complex coding, and scientific reasoning, yet their ability to accurately represent and simulate physical scenarios via code remains underexplored. We propose SimuScene, the first systematic study that trains and evaluates LLMs on simulating physical scenarios across five physics domains and 52 physical concepts. We build an automatic pipeline to collect data, with human verification to ensure quality. The final dataset contains 7,659 physical scenarios with 334 human-verified examples as the test set. We evaluated 10 contemporary LLMs and found that even the strongest model achieves only a 21.5% pass rate, demonstrating the difficulty of the task. Finally, we introduce a reinforcement learning pipeline with visual rewards that uses a vision-language model as a judge to train textual models. Experiments show that training with our data improves physical simulation via code while substantially enhancing general code generation performance.
翻译:大型语言模型(LLMs)在数学竞赛、复杂编码和科学推理等任务中已得到广泛研究,但其通过代码准确表征和模拟物理场景的能力仍未得到充分探索。我们提出了SimuScene,这是首个在五个物理领域和52个物理概念上对LLMs进行物理场景模拟训练与评估的系统性研究。我们构建了自动数据收集流程,并通过人工验证确保质量。最终数据集包含7,659个物理场景,其中334个人工验证样本作为测试集。我们对10个当代LLMs进行了评估,发现即使最强模型也仅达到21.5%的通过率,证明了该任务的难度。最后,我们提出了一种基于视觉奖励的强化学习流程,该流程使用视觉语言模型作为评判器来训练文本模型。实验表明,使用我们的数据进行训练能提升通过代码进行物理模拟的能力,同时显著增强通用代码生成性能。