3D reconstruction has been widely used in autonomous navigation fields of mobile robotics. However, the former research can only provide the basic geometry structure without the capability of open-world scene understanding, limiting advanced tasks like human interaction and visual navigation. Moreover, traditional 3D scene understanding approaches rely on expensive labeled 3D datasets to train a model for a single task with supervision. Thus, geometric reconstruction with zero-shot scene understanding i.e. Open vocabulary 3D Understanding and Reconstruction, is crucial for the future development of mobile robots. In this paper, we propose OpenOcc, a novel framework unifying the 3D scene reconstruction and open vocabulary understanding with neural radiance fields. We model the geometric structure of the scene with occupancy representation and distill the pre-trained open vocabulary model into a 3D language field via volume rendering for zero-shot inference. Furthermore, a novel semantic-aware confidence propagation (SCP) method has been proposed to relieve the issue of language field representation degeneracy caused by inconsistent measurements in distilled features. Experimental results show that our approach achieves competitive performance in 3D scene understanding tasks, especially for small and long-tail objects.
翻译:三维重建已广泛应用于移动机器人的自主导航领域。然而,现有研究仅能提供基础几何结构,缺乏开放世界场景理解能力,从而限制了人机交互和视觉导航等高级任务。此外,传统三维场景理解方法依赖昂贵的标注三维数据集,通过监督训练单一任务模型。因此,实现零样本场景理解的几何重建(即开放词汇三维理解与重建)对移动机器人的未来发展至关重要。本文提出OpenOcc——一种融合三维场景重建与开放词汇理解的新型框架,该框架基于神经辐射场实现。我们利用占用表示对场景几何结构进行建模,并通过体渲染将预训练的开放词汇模型蒸馏至三维语言场中,以实现零样本推理。此外,我们提出一种新颖的语义感知置信度传播(SCP)方法,以缓解蒸馏特征中不一致测量导致的语言场表示退化问题。实验结果表明,我们的方法在三维场景理解任务中,尤其针对小目标及长尾目标,取得了具有竞争力的性能。