Open-vocabulary 3D Scene Graph (3DSG) generation can enhance various downstream tasks in robotics, such as manipulation and navigation, by leveraging structured semantic representations. A 3DSG is constructed from multiple images of a scene, where objects are represented as nodes and relationships as edges. However, existing works for open-vocabulary 3DSG generation suffer from both low object-level recognition accuracy and speed, mainly due to constrained viewpoints, occlusions, and redundant surface density. To address these challenges, we propose RAG-3DSG to mitigate aggregation noise through re-shot guided uncertainty estimation and support object-level Retrieval-Augmented Generation (RAG) via reliable low-uncertainty objects. Furthermore, we propose a dynamic downsample-mapping strategy to accelerate cross-image object aggregation with adaptive granularity. Experiments on Replica dataset demonstrate that RAG-3DSG significantly improves node captioning accuracy in 3DSG generation while reducing the mapping time by two-thirds compared to the vanilla version.
翻译:开放词汇三维场景图(3DSG)生成能够利用结构化的语义表示,增强机器人学中多种下游任务(如操作与导航)的性能。三维场景图由场景的多个图像构建而成,其中对象表示为节点,关系表示为边。然而,现有的开放词汇三维场景图生成方法在对象级识别准确率和速度上均存在不足,这主要受限于视角约束、遮挡以及冗余的表面密度。为应对这些挑战,我们提出了RAG-3DSG,该方法通过重拍引导的不确定性估计来减轻聚合噪声,并借助可靠的低不确定性对象支持对象级的检索增强生成。此外,我们提出了一种动态下采样-映射策略,以自适应粒度加速跨图像对象聚合。在Replica数据集上的实验表明,与原始版本相比,RAG-3DSG显著提升了三维场景图生成中的节点描述准确率,同时将映射时间减少了三分之二。