Vision-Language Models (VLMs) exhibit remarkable common-sense and semantic reasoning capabilities. However, they lack a grounded understanding of physical dynamics. This limitation arises from training VLMs on static internet-scale visual-language data that contain no causal interactions or action-conditioned changes. Consequently, it remains challenging to leverage VLMs for fine-grained robotic manipulation tasks that require physical understanding, reasoning, and corresponding action planning. To overcome this, we present SIMPACT, a test-time, SIMulation-enabled ACTion Planning framework that equips VLMs with physical reasoning through simulation-in-the-loop world modeling, without requiring any additional training. From a single RGB-D observation, SIMPACT efficiently constructs physics simulations, enabling the VLM to propose informed actions, observe simulated rollouts, and iteratively refine its reasoning. By integrating language reasoning with physics prediction, our simulation-enabled VLM can understand contact dynamics and action outcomes in a physically grounded way. Our method demonstrates state-of-the-art performance on five challenging, real-world rigid-body and deformable manipulation tasks that require fine-grained physical reasoning, outperforming existing general-purpose robotic manipulation models. Our results demonstrate that embedding physics understanding via efficient simulation into VLM reasoning at test time offers a promising path towards generalizable embodied intelligence. Project webpage can be found at https://simpact-bot.github.io
翻译:视觉-语言模型展现出卓越的常识与语义推理能力,但其缺乏对物理动力学的具身化理解。这一局限性源于该类模型在静态互联网规模的视觉-语言数据上训练,而这些数据不包含因果交互或动作条件变化。因此,利用视觉-语言模型执行需要物理理解、推理及相应动作规划的精细机器人操作任务仍具挑战性。为解决此问题,我们提出SIMPACT——一种测试时仿真驱动的动作规划框架,通过仿真闭环世界建模赋予视觉-语言模型物理推理能力,且无需额外训练。基于单张RGB-D观测,SIMPACT高效构建物理仿真环境,使视觉-语言模型能够提出知情动作、观察仿真展开过程,并迭代优化其推理。通过将语言推理与物理预测相结合,我们的仿真增强型视觉-语言模型能以物理具身化的方式理解接触动力学及动作结果。该方法在五项需要精细物理推理的真实刚体与可变形体操作任务中实现了当前最优性能,超越现有通用机器人操作模型。实验结果证明,在测试时通过高效仿真将物理理解嵌入视觉-语言模型推理,为迈向通用具身智能提供了可行路径。项目主页参见https://simpact-bot.github.io