Effective scene representation is critical for the visual grounding ability of representations, yet existing methods for 3D Visual Grounding are often constrained. They either only focus on geometric and visual cues, or, like traditional 3D scene graphs, lack the multi-dimensional attributes needed for complex reasoning. To bridge this gap, we introduce the Diverse Semantic Map (DSM) framework, a novel scene representation framework that enriches robust geometric models with a spectrum of VLM-derived semantics, including appearance, physical properties, and affordances. The DSM is first constructed online by fusing multi-view observations within a temporal sliding window, creating a persistent and comprehensive world model. Building on this foundation, we propose DSM-Grounding, a new paradigm that shifts grounding from free-form VLM queries to a structured reasoning process over the semantic-rich map, markedly improving accuracy and interpretability. Extensive evaluations validate our approach's superiority. On the ScanRefer benchmark, DSM-Grounding achieves a state-of-the-art 59.06% overall accuracy of IoU@0.5, surpassing others by 10%. In semantic segmentation, our DSM attains a 67.93% F-mIoU, outperforming all baselines, including privileged ones. Furthermore, successful deployment on physical robots for complex navigation and grasping tasks confirms the framework's practical utility in real-world scenarios.
翻译:有效的场景表征对于表征的视觉定位能力至关重要,然而现有的三维视觉定位方法往往受到限制。它们要么仅关注几何和视觉线索,要么像传统的三维场景图一样,缺乏复杂推理所需的多维属性。为了弥补这一差距,我们引入了多样化语义地图(DSM)框架,这是一种新颖的场景表征框架,它通过一系列源自视觉语言模型(VLM)的语义(包括外观、物理属性和可供性)来增强鲁棒的几何模型。DSM首先通过在时间滑动窗口内融合多视角观测在线构建,从而创建一个持久且全面的世界模型。在此基础上,我们提出了DSM-Grounding,这是一种新的范式,它将定位从自由形式的VLM查询转变为在语义丰富的地图上进行结构化推理的过程,显著提高了准确性和可解释性。广泛的评估验证了我们方法的优越性。在ScanRefer基准测试中,DSM-Grounding在IoU@0.5条件下达到了59.06%的总体准确率,处于领先水平,比其他方法高出10%。在语义分割方面,我们的DSM获得了67.93%的F-mIoU,超越了所有基线方法,包括具有特权信息的方法。此外,在物理机器人上成功部署以完成复杂导航和抓取任务,证实了该框架在现实场景中的实际效用。