Humans can look at a static scene and instantly predict what happens next -- will moving this object cause a collision? We call this ability Causal Spatial Reasoning. However, current multimodal large language models (MLLMs) cannot do this, as they remain largely restricted to static spatial perception, struggling to answer "what-if" questions in a 3D scene. We introduce CausalSpatial, a diagnostic benchmark evaluating whether models can anticipate consequences of object motions across four tasks: Collision, Compatibility, Occlusion, and Trajectory. Results expose a severe gap: humans score 84% while GPT-5 achieves only 54%. Why do MLLMs fail? Our analysis uncovers a fundamental deficiency: models over-rely on textual chain-of-thought reasoning that drifts from visual evidence, producing fluent but spatially ungrounded hallucinations. To address this, we propose the Causal Object World model (COW), a framework that externalizes the simulation process by generating videos of hypothetical dynamics. With explicit visual cues of causality, COW enables models to ground their reasoning in physical reality rather than linguistic priors. We make the dataset and code publicly available here: https://github.com/CausalSpatial/CausalSpatial
翻译:人类能够观察静态场景并瞬间预测接下来会发生什么——移动这个物体会导致碰撞吗?我们将这种能力称为因果空间推理。然而,当前的多模态大语言模型(MLLMs)尚无法做到这一点,因为它们主要局限于静态空间感知,难以回答三维场景中的“假设”问题。我们提出了CausalSpatial,这是一个诊断性基准,用于评估模型在四项任务中预测物体运动后果的能力:碰撞、兼容性、遮挡和轨迹。结果揭示了一个显著的差距:人类得分84%,而GPT-5仅达到54%。MLLMs为何失败?我们的分析揭示了一个根本性缺陷:模型过度依赖脱离视觉证据的文本链式推理,产生了流畅但空间上无根据的幻觉。为解决这一问题,我们提出了因果对象世界模型(COW),这是一个通过生成假设动态视频来外化模拟过程的框架。借助明确的因果视觉线索,COW使模型能够将其推理基于物理现实而非语言先验。我们将数据集和代码公开于此:https://github.com/CausalSpatial/CausalSpatial