Embodied agents that navigate cities rely on world models that predict how their surroundings will change as they move. But for navigation, what matters is not what the buildings look like; it is where the agent can go. Most world models nonetheless predict appearance, learning how a scene looks rather than the space an agent can move through. Those that do target geometry, such as bird's-eye-view occupancy grids, flatten the three-dimensional environment onto a ground plane, discarding the above-ground and multi-level structure that shapes real navigation. What is missing is a predictive target that captures the navigable geometry an agent actually traverses, without photometric entanglement and without collapsing the third dimension. Our key idea is to model the open volume between buildings, the negative space, encoded as a 3D isovist: a spherical visibility-depth map recording the distance to the nearest surface in every direction. We introduce an embodied world model that predicts the next isovist from a short history of past isovists and a movement action. The prediction is formulated as a depth residual so the decoder inherits sharp building edges, trained with self-rollout scheduled sampling to keep corrupted context on the geometry manifold, and equipped with a persistent latent bird's-eye-view spatial map for cross-path consistency. Our central finding is emergent and unexpected: a single city-blind model trained on Manhattan and Paris develops a cross-city spatial signature, with city identity linearly decodable from its temporal latents far above single-frame baselines, so the signature lives in the learned dynamics rather than in appearance. The representation is lightweight, interpretable, and reproducible, offering a geometric substrate for spatial reasoning in embodied AI, robotics, and urban analysis, released with an open dataset and pipeline.
翻译:具身智能体在城市中导航时依赖世界模型来预测其周围环境如何随移动而变化。然而,对于导航而言,重要的并非建筑物的外观,而是智能体可到达的区域。尽管如此,大多数世界模型仍预测外观,学习场景的视觉呈现,而非智能体可穿越的空间。那些确实以几何结构为目标的模型(如鸟瞰视角的占据网格)将三维环境压缩到地平面,忽略了构成真实导航场景的地上及多层结构。目前缺失的是一个能捕捉智能体实际穿越的可导航几何结构、同时避免光度纠缠和维度坍塌的预测目标。我们的核心思想是对建筑之间的开放空间(即负空间)进行建模,将其编码为三维等视域:一种球形可见性深度图,记录每个方向上到最近表面的距离。我们提出一种具身世界模型,根据过去短时间内的等视域序列及运动动作预测下一个等视域。该预测被公式化为深度残差,使解码器继承锐利的建筑边缘;模型通过自回放计划采样进行训练,以保持几何流形上的上下文准确性;并配备持久化的潜在鸟瞰空间图以实现跨路径一致性。我们的核心发现具有涌现性与意外性:一个仅基于曼哈顿和巴黎数据训练、对城市无偏见的模型产生了跨城空间特征,其时间潜在向量(远高于单帧基线)可线性解码城市身份——这表明该特征存在于学习到的动态规律中,而非外观层面。该表征轻量、可解释且可复现,为具身AI、机器人学和城市分析中的空间推理提供了几何基础,并随附开放数据集与流程发布。