Modern foundational Multimodal Large Language Models (MLLMs) and video world models have advanced significantly in mathematical, common-sense, and visual reasoning, but their grasp of the underlying physics remains underexplored. Existing benchmarks attempting to measure this matter rely on synthetic, Visual Question Answer templates or focus on perceptual video quality that is tangential to measuring how well the video abides by physical laws. To address this fragmentation, we introduce PhysicsMind, a unified benchmark with both real and simulation environments that evaluates law-consistent reasoning and generation over three canonical principles: Center of Mass, Lever Equilibrium, and Newton's First Law. PhysicsMind comprises two main tasks: i) VQA tasks, testing whether models can reason and determine physical quantities and values from images or short videos, and ii) Video Generation(VG) tasks, evaluating if predicted motion trajectories obey the same center-of-mass, torque, and inertial constraints as the ground truth. A broad range of recent models and video generation models is evaluated on PhysicsMind and found to rely on appearance heuristics while often violating basic mechanics. These gaps indicate that current scaling and training are still insufficient for robust physical understanding, underscoring PhysicsMind as a focused testbed for physics-aware multimodal models. Our data will be released upon acceptance.
翻译:现代基础多模态大语言模型(MLLMs)与视频世界模型在数学推理、常识推理及视觉推理方面已取得显著进展,但其对底层物理规律的理解仍待深入探索。现有试图衡量此方面能力的基准测试多依赖于合成的视觉问答模板,或侧重于与视频是否符合物理定律这一核心问题关联度较低的感知视频质量评估。为解决这一碎片化问题,我们提出了PhysicsMind——一个包含真实与仿真环境的统一基准测试,用于评估模型在三个经典物理原理(质心、杠杆平衡与牛顿第一定律)上的一致性推理与生成能力。PhysicsMind包含两大主要任务:i) 视觉问答任务,测试模型能否从图像或短视频中推理并确定物理量及其数值;ii) 视频生成任务,评估模型预测的运动轨迹是否与真实情况遵循相同的质心、力矩及惯性约束。我们在PhysicsMind上对一系列近期发布的模型及视频生成模型进行了广泛评估,发现这些模型多依赖表象启发式规则,且时常违反基础力学原理。这些差距表明,当前的模型扩展与训练方法仍不足以实现鲁棒的物理理解,凸显了PhysicsMind作为面向物理感知多模态模型的聚焦测试平台的重要性。我们的数据将在论文录用后公开发布。