DiscoverPhysics: Benchmarking LLMs for Out-of-the-Box Scientific Thinking

Frontier LLMs now perform strongly across a wide range of physics evaluations, but it is hard to disentangle genuine reasoning from recall of established science. We introduce DiscoverPhysics, an interactive benchmark that asks a LLM agent to discover the laws of motion of a simulated world whose physics deliberately deviates from our own. We construct 22 worlds governed by, among others, screened and fractional-power gravity, multi-species couplings, hidden dark-matter-like particles, non-coordinate-free physics, and time-varying interactions. Each world is generated on demand by an N-body simulator, for which the agent proposes several rounds of experiments, observes raw trajectory data, and ultimately submits both a natural-language explanation of the world's physics and a Python implementation of the inferred law. Because solving a world requires the agent to design informative experiments and revise its hypotheses, the benchmark probes long-horizon reasoning over an experimental history. We evaluate submissions along two complementary axes: trajectory MSE on held-out particles and an LLM-judged explanation score following an expert-written rubric assessing conceptual understanding of each world. Across eleven frontier models, we find that the strongest agents pass only half of the worlds and consistently fail on those where latent structure must be uncovered. Open-source models lag substantially behind commercial models, both in their ability to design informative experiments and in extracting conclusions from the data. We further find that good predictive accuracy does not guarantee high explanation quality and that conceptual understanding depends on hypothesis refinement through well-chosen experiments.

翻译：前沿大语言模型（LLM）在广泛的物理学评估中表现强劲，但难以区分其是真正推理还是对已知科学知识的复述。我们提出DiscoverPhysics——一个交互式基准测试，要求LLM代理探索一个模拟世界的运动定律，该世界的物理规律刻意偏离现实。我们构建了22个世界，涵盖屏蔽引力、分数次幂引力、多物种耦合、隐暗物质粒子、非坐标无关物理及含时相互作用等机制。每个世界由N体模拟器按需生成，代理需提出多轮实验、观察原始轨迹数据，并最终提交描述该世界物理学的自然语言解释以及实现推断定律的Python代码。由于解答世界要求代理设计信息性实验并修正假设，该基准测试可探测基于实验历史的长时序推理能力。我们从两个互补维度评估方案：保留粒子的轨迹均方误差（MSE），以及基于专家编写的评估各世界概念理解程度的评分规则、由LLM评判的解释得分。对十一个前沿模型的评估显示，最强代理仅能通过半数世界，且在需揭示隐结构的任务中始终失败。开源模型在设计信息性实验及从数据中提炼结论的能力上显著落后于商业模型。此外，高预测准确性并不能保证高质量解释，概念理解取决于通过精心设计实验进行假设修正。