OV-NeRF: Open-vocabulary Neural Radiance Fields with Vision and Language Foundation Models for 3D Semantic Understanding

The development of Neural Radiance Fields (NeRFs) has provided a potent representation for encapsulating the geometric and appearance characteristics of 3D scenes. Enhancing the capabilities of NeRFs in open-vocabulary 3D semantic perception tasks has been a recent focus. However, current methods that extract semantics directly from Contrastive Language-Image Pretraining (CLIP) for semantic field learning encounter difficulties due to noisy and view-inconsistent semantics provided by CLIP. To tackle these limitations, we propose OV-NeRF, which exploits the potential of pre-trained vision and language foundation models to enhance semantic field learning through proposed single-view and cross-view strategies. First, from the single-view perspective, we introduce Region Semantic Ranking (RSR) regularization by leveraging 2D mask proposals derived from SAM to rectify the noisy semantics of each training view, facilitating accurate semantic field learning. Second, from the cross-view perspective, we propose a Cross-view Self-enhancement (CSE) strategy to address the challenge raised by view-inconsistent semantics. Rather than invariably utilizing the 2D inconsistent semantics from CLIP, CSE leverages the 3D consistent semantics generated from the well-trained semantic field itself for semantic field training, aiming to reduce ambiguity and enhance overall semantic consistency across different views. Extensive experiments validate our OV-NeRF outperforms current state-of-the-art methods, achieving a significant improvement of 20.31% and 18.42% in mIoU metric on Replica and Scannet, respectively. Furthermore, our approach exhibits consistent superior results across various CLIP configurations, further verifying its robustness.

翻译：神经辐射场（NeRFs）的发展为封装三维场景的几何与外观特征提供了强有力的表征。提升NeRFs在开放词汇三维语义感知任务中的能力是近年来的关注焦点。然而，当前直接利用对比语言-图像预训练（CLIP）提取语义以进行语义场学习的方法，因CLIP提供的高噪声和视角不一致语义而面临困境。为解决这些局限性，我们提出OV-NeRF，通过设计的单视角与跨视角策略，挖掘预训练视觉与语言基础模型的潜力以增强语义场学习。首先，在单视角层面，我们引入区域语义排序（RSR）正则化，利用SAM生成的二维掩码提议修正各训练视角的噪声语义，促进精准语义场学习。其次，在跨视角层面，我们提出跨视角自增强（CSE）策略应对视角不一致语义带来的挑战。CSE并非始终沿用CLIP提供的二维不一致语义，而是利用已良好训练的语义场自身生成的三维一致语义进行语义场训练，旨在降低歧义性并增强不同视角间的整体语义一致性。大量实验验证了OV-NeRF性能超越当前最先进方法，在Replica和ScanNet数据集上mIoU指标分别提升20.31%和18.42%。此外，本方法在不同CLIP配置下均展现出持续优越的性能，进一步验证了其鲁棒性。