3D reconstruction has been widely used in autonomous navigation fields of mobile robotics. However, the former research can only provide the basic geometry structure without the capability of open-world scene understanding, limiting advanced tasks like human interaction and visual navigation. Moreover, traditional 3D scene understanding approaches rely on expensive labeled 3D datasets to train a model for a single task with supervision. Thus, geometric reconstruction with zero-shot scene understanding i.e. Open vocabulary 3D Understanding and Reconstruction, is crucial for the future development of mobile robots. In this paper, we propose OpenOcc, a novel framework unifying the 3D scene reconstruction and open vocabulary understanding with neural radiance fields. We model the geometric structure of the scene with occupancy representation and distill the pre-trained open vocabulary model into a 3D language field via volume rendering for zero-shot inference. Furthermore, a novel semantic-aware confidence propagation (SCP) method has been proposed to relieve the issue of language field representation degeneracy caused by inconsistent measurements in distilled features. Experimental results show that our approach achieves competitive performance in 3D scene understanding tasks, especially for small and long-tail objects.
翻译:三维重建已广泛应用于移动机器人自主导航领域。然而,现有研究仅能提供基础几何结构,缺乏开放世界场景理解能力,限制了人机交互与视觉导航等高级任务的发展。此外,传统三维场景理解方法依赖昂贵的标注三维数据集,通过监督训练仅能完成单一任务。因此,具备零样本场景理解能力的几何重建——即开放词汇三维理解与重建——对移动机器人的未来发展至关重要。本文提出OpenOcc,一种基于神经辐射场统一三维场景重建与开放词汇理解的新型框架。我们采用占据表示对场景几何结构进行建模,并通过体渲染将预训练的开放词汇模型知识蒸馏至三维语言场,以实现零样本推理。此外,本文提出一种新颖的语义感知置信度传播方法,以缓解特征蒸馏过程中因测量不一致导致的语言场表示退化问题。实验结果表明,本方法在三维场景理解任务中取得了具有竞争力的性能,尤其在小物体与长尾物体识别方面表现突出。