The ability to reason about spatial dynamics is a cornerstone of intelligence, yet current research overlooks the human intent behind spatial changes. To address these limitations, we introduce Teleo-Spatial Intelligence (TSI), a new paradigm that unifies two critical pillars: Physical-Dynamic Reasoning--understanding the physical principles of object interactions--and Intent-Driven Reasoning--inferring the human goals behind these actions. To catalyze research in TSI, we present EscherVerse, consisting of a large-scale, open-world benchmark (Escher-Bench), a dataset (Escher-35k), and models (Escher series). Derived from real-world videos, EscherVerse moves beyond constrained settings to explicitly evaluate an agent's ability to reason about object permanence, state transitions, and trajectory prediction in dynamic, human-centric scenarios. Crucially, it is the first benchmark to systematically assess Intent-Driven Reasoning, challenging models to connect physical events to their underlying human purposes. Our work, including a novel data curation pipeline, provides a foundational resource to advance spatial intelligence from passive scene description toward a holistic, purpose-driven understanding of the world.
翻译:空间动态推理能力是智能的基石,然而当前研究忽视了空间变化背后的人类意图。为应对这些局限,我们提出了目的性空间智能(Teleo-Spatial Intelligence, TSI)这一新范式,它统一了两个关键支柱:物理动态推理——理解物体相互作用的物理原理,以及意图驱动推理——推断这些行动背后的人类目标。为促进TSI研究,我们推出EscherVerse,包含一个大规模开放世界基准(Escher-Bench)、一个数据集(Escher-35k)和系列模型(Escher系列)。EscherVerse源自真实世界视频,突破了受限场景的限制,能显式评估智能体在动态、以人为中心的场景中对物体恒存性、状态转移和轨迹预测的推理能力。尤为关键的是,这是首个系统评估意图驱动推理的基准,要求模型将物理事件与其背后的人类目的相连接。我们的工作(包括新颖的数据构建流程)为推动空间智能从被动场景描述转向对世界的整体性、目的驱动理解提供了基础性资源。