Robotic tasks such as planning and navigation require a hierarchical semantic understanding of a scene, which could include multiple floors and rooms. Current methods primarily focus on object segmentation for 3D scene understanding. However, such methods struggle to segment out topological regions like "kitchen" in the scene. In this work, we introduce a two-step pipeline to solve this problem. First, we extract a topological map, i.e., floorplan of the indoor scene using a novel multi-channel occupancy representation. Then, we generate CLIP-aligned features and semantic labels for every room instance based on the objects it contains using a self-attention transformer. Our language-topology alignment supports natural language querying, e.g., a "place to cook" locates the "kitchen". We outperform the current state-of-the-art on room segmentation by ~20% and room classification by ~12%. Our detailed qualitative analysis and ablation studies provide insights into the problem of joint structural and semantic 3D scene understanding. Project Page: quest-maps.github.io
翻译:机器人任务(如路径规划与导航)需要对场景进行层次化语义理解,这类场景可能包含多个楼层与房间。现有方法主要聚焦于三维场景理解中的对象分割,但此类方法难以分割出场景中如“厨房”等拓扑区域。本研究提出一种两阶段流程以解决该问题:首先,我们通过新型多通道占据表示提取室内场景的拓扑地图(即平面图);随后,基于各房间实例包含的物体,采用自注意力Transformer生成CLIP对齐特征与语义标签。我们的语言-拓扑对齐机制支持自然语言查询,例如“烹饪场所”可定位至“厨房”。在房间分割任务上我们以约20%的优势超越当前最优方法,房间分类任务提升约12%。通过详尽的定性分析与消融实验,我们为结构-语义联合三维场景理解问题提供了新的见解。项目页面:quest-maps.github.io