SNOW：基于世界知识的开放世界具身推理时空场景理解 (SNOW: Spatio-Temporal Scene Understanding with World Knowledge for Open-World Embodied Reasoning)

Autonomous robotic systems require spatio-temporal understanding of dynamic environments to ensure reliable navigation and interaction. While Vision-Language Models (VLMs) provide open-world semantic priors, they lack grounding in 3D geometry and temporal dynamics. Conversely, geometric perception captures structure and motion but remains semantically sparse. We propose SNOW (Scene Understanding with Open-World Knowledge), a training-free and backbone-agnostic framework for unified 4D scene understanding that integrates VLM-derived semantics with point cloud geometry and temporal consistency. SNOW processes synchronized RGB images and 3D point clouds, using HDBSCAN clustering to generate object-level proposals that guide SAM2-based segmentation. Each segmented region is encoded through our proposed Spatio-Temporal Tokenized Patch Encoding (STEP), producing multimodal tokens that capture localized semantic, geometric, and temporal attributes. These tokens are incrementally integrated into a 4D Scene Graph (4DSG), which serves as 4D prior for downstream reasoning. A lightweight SLAM backend anchors all STEP tokens spatially in the environment, providing the global reference alignment, and ensuring unambiguous spatial grounding across time. The resulting 4DSG forms a queryable, unified world model through which VLMs can directly interpret spatial scene structure and temporal dynamics. Experiments on a diverse set of benchmarks demonstrate that SNOW enables precise 4D scene understanding and spatially grounded inference, thereby setting new state-of-the-art performance in several settings, highlighting the importance of structured 4D priors for embodied reasoning and autonomous robotics.

翻译：自主机器人系统需要对动态环境进行时空理解，以确保可靠的导航与交互。尽管视觉语言模型（VLMs）提供了开放世界的语义先验，但其缺乏与三维几何和时序动态的关联。相反，几何感知虽能捕获结构与运动信息，却仍处于语义稀疏状态。本文提出SNOW（基于开放世界知识的场景理解），一种无需训练且与主干网络无关的统一四维场景理解框架，该框架将VLM衍生的语义与点云几何及时序一致性相融合。SNOW处理同步的RGB图像与三维点云数据，利用HDBSCAN聚类生成物体级候选区域，进而引导基于SAM2的分割。每个分割区域通过我们提出的时空令牌化补丁编码（STEP）进行编码，生成捕获局部语义、几何及时序属性的多模态令牌。这些令牌被逐步整合至四维场景图（4DSG）中，作为下游推理的四维先验知识。一个轻量级SLAM后端将所有STEP令牌在环境中进行空间锚定，提供全局参考对齐，并确保跨时间的明确空间关联。最终形成的4DSG构建了一个可查询的统一世界模型，VLMs可直接通过该模型解析空间场景结构及时序动态。在多样化基准测试上的实验表明，SNOW能够实现精确的四维场景理解与空间关联推理，从而在多个场景中创造了新的最优性能，凸显了结构化四维先验对于具身推理与自主机器人技术的重要性。