Applying NeRF to downstream perception tasks for scene understanding and representation is becoming increasingly popular. Most existing methods treat semantic prediction as an additional rendering task, \textit{i.e.}, the "label rendering" task, to build semantic NeRFs. However, by rendering semantic/instance labels per pixel without considering the contextual information of the rendered image, these methods usually suffer from unclear boundary segmentation and abnormal segmentation of pixels within an object. To solve this problem, we propose Generalized Perception NeRF (GP-NeRF), a novel pipeline that makes the widely used segmentation model and NeRF work compatibly under a unified framework, for facilitating context-aware 3D scene perception. To accomplish this goal, we introduce transformers to aggregate radiance as well as semantic embedding fields jointly for novel views and facilitate the joint volumetric rendering of both fields. In addition, we propose two self-distillation mechanisms, i.e., the Semantic Distill Loss and the Depth-Guided Semantic Distill Loss, to enhance the discrimination and quality of the semantic field and the maintenance of geometric consistency. In evaluation, we conduct experimental comparisons under two perception tasks (\textit{i.e.} semantic and instance segmentation) using both synthetic and real-world datasets. Notably, our method outperforms SOTA approaches by 6.94\%, 11.76\%, and 8.47\% on generalized semantic segmentation, finetuning semantic segmentation, and instance segmentation, respectively.
翻译:将神经辐射场(NeRF)应用于下游感知任务以实现场景理解与表征正日益流行。现有方法大多将语义预测视为附加的渲染任务(即“标签渲染”),以构建语义NeRF。然而,由于逐像素渲染语义/实例标签时未考虑渲染图像的上下文信息,这些方法常出现边界分割模糊及对象内部像素异常分割的问题。为解决此问题,我们提出广义感知神经辐射场(GP-NeRF)——一种新颖的流水线,使广泛使用的分割模型与NeRF在统一框架下兼容运作,以促进上下文感知的三维场景感知。为实现此目标,我们引入Transformer联合聚合辐射场与语义嵌入场,用于生成新视角,并促进两场的联合体渲染。此外,我们提出两种自蒸馏机制——语义蒸馏损失和深度引导语义蒸馏损失——以增强语义场的判别力与质量,并保持几何一致性。在评估中,我们使用合成数据集与真实数据集在两项感知任务(语义分割与实例分割)上进行实验对比。值得注意的是,我们的方法在广义语义分割、微调语义分割和实例分割任务上分别超越当前最优方法6.94%、11.76%和8.47%。