We present SpatialMem, a memory-centric system that unifies 3D geometry, semantics, and language into a single, queryable representation. Starting from casually captured egocentric RGB video, SpatialMem reconstructs metrically scaled indoor environments, detects structural 3D anchors (walls, doors, windows) as the first-layer scaffold, and populates a hierarchical memory with open-vocabulary object nodes -- linking evidence patches, visual embeddings, and two-layer textual descriptions to 3D coordinates -- for compact storage and fast retrieval. This design enables interpretable reasoning over spatial relations (e.g., distance, direction, visibility) and supports downstream tasks such as language-guided navigation and object retrieval without specialized sensors. Experiments across three real-life indoor scenes demonstrate that SpatialMem maintains strong anchor-description-level navigation completion and hierarchical retrieval accuracy under increasing clutter and occlusion, offering an efficient and extensible framework for embodied spatial intelligence.
翻译:本文提出SpatialMem,一种以记忆为核心的系统,将三维几何结构、语义信息与语言统一整合为单一可查询的表征。该系统从随意采集的自我中心视角RGB视频出发,重建度量尺度化的室内环境,检测结构性三维锚点(墙壁、门、窗)作为第一层骨架,并以开放词汇对象节点填充层次化记忆——将证据图像块、视觉嵌入向量及双层文本描述链接至三维坐标——以实现紧凑存储与快速检索。该设计支持对空间关系(如距离、方向、可见性)进行可解释推理,并赋能下游任务(如语言引导导航与物体检索)而无需专用传感器。在三个真实室内场景中的实验表明,随着杂乱程度与遮挡增加,SpatialMem始终保持优异的锚点描述级导航完成度与层次化检索精度,为具身空间智能提供了一个高效且可扩展的框架。