Open-vocabulary scene understanding is crucial for robotic applications, enabling robots to comprehend complex 3D environmental contexts and supporting various downstream tasks such as navigation and manipulation. However, existing methods require pre-built complete 3D semantic maps to construct scene graphs for scene understanding, which limits their applicability in robotic scenarios where environments are explored incrementally. To address this challenge, we propose OGScene3D, an open-vocabulary scene understanding system that achieves accurate 3D semantic mapping and scene graph construction incrementally. Our system employs a confidence-based Gaussian semantic representation that jointly models semantic predictions and their reliability, enabling robust scene modeling. Building on this representation, we introduce a hierarchical 3D semantic optimization strategy that achieves semantic consistency through local correspondence establishment and global refinement, thereby constructing globally consistent semantic maps. Moreover, we design a long-term global optimization method that leverages temporal memory of historical observations to enhance semantic predictions. By integrating 2D-3D semantic consistency with Gaussian rendering contribution, this method continuously refines the semantic understanding of the entire scene.Furthermore, we develop a progressive graph construction approach that dynamically creates and updates both nodes and semantic relationships, allowing continuous updating of the 3D scene graphs. Extensive experiments on widely used datasets and real-world scenes demonstrate the effectiveness of our OGScene3D on open-vocabulary scene understanding.
翻译:开放词汇场景理解对于机器人应用至关重要,它使机器人能够理解复杂的三维环境上下文,并支持导航与操作等多种下游任务。然而,现有方法需要预先构建完整的三维语义地图以构建用于场景理解的场景图,这限制了它们在环境被增量式探索的机器人场景中的适用性。为解决这一挑战,我们提出了OGScene3D,一个开放词汇场景理解系统,能够增量式地实现精确的三维语义建图与场景图构建。我们的系统采用一种基于置信度的高斯语义表示,该表示联合建模语义预测及其可靠性,从而实现鲁棒的场景建模。基于此表示,我们引入了一种分层三维语义优化策略,该策略通过建立局部对应关系与全局细化来实现语义一致性,从而构建全局一致的语义地图。此外,我们设计了一种长期全局优化方法,该方法利用历史观测的时间记忆来增强语义预测。通过将2D-3D语义一致性与高斯渲染贡献相结合,此方法持续优化对整个场景的语义理解。进一步地,我们开发了一种渐进式图构建方法,能够动态创建并更新节点及语义关系,从而实现三维场景图的持续更新。在广泛使用的数据集和真实场景上进行的大量实验证明了我们的OGScene3D在开放词汇场景理解方面的有效性。