Previous work has evaluated physics reasoning in foundation models using synthetic or semi-synthetic scenes and visual question-answering tasks. However, these benchmarks emphasize high-level events and lack the visual fidelity required to assess true low-level Newtonian understanding. We introduce NewtPhys, a 4D physically annotated dataset built from multiview images of real-world scenes with physics-grounded simulations. The dataset provides dense, fine-grained annotations across timesteps -- including 3D forces and amodal per-pixel quantities covering physics, tracking, semantics and geometry -- bridging the gap between simplistic synthetic setups and realistic visual complexity. Using NewtPhys, we systematically evaluate 56 VLMs, including 54 open-weight models and 2 closed-source frontier models, and 10 VFMs and reveal limitations in low-level physics reasoning. Beyond benchmarking, our dataset enables future research in physics-grounded vision and the development of next-generation physics-aware evaluations. Code and datasets are available at https://astra-vision.github.io/NewtPhys.
翻译:先前研究通过合成或半合成场景及视觉问答任务,评估了基础模型在物理推理方面的能力。然而,这些基准测试侧重于高层事件,缺乏评估真正低层牛顿理解所需的视觉保真度。我们提出NewtPhys,这是一个基于真实世界场景多视图图像和物理驱动模拟构建的四维物理标注数据集。该数据集在时间步上提供密集的细粒度标注——包括三维力与非模态逐像素量(涵盖物理、跟踪、语义和几何)——弥合了简单合成环境与真实视觉复杂度之间的鸿沟。利用NewtPhys,我们系统评估了56个视觉语言模型(含54个开源模型与2个闭源前沿模型)及10个视觉基础模型,揭示了它们在低层物理推理中的局限性。除基准测试外,本数据集还可支撑未来基于物理的视觉研究,以及下一代物理感知评估方法的开发。代码和数据集详见https://astra-vision.github.io/NewtPhys。