Large visual-language models (VLMs), like CLIP, enable open-set image segmentation to segment arbitrary concepts from an image in a zero-shot manner. This goes beyond the traditional closed-set assumption, i.e., where models can only segment classes from a pre-defined training set. More recently, first works on open-set segmentation in 3D scenes have appeared in the literature. These methods are heavily influenced by closed-set 3D convolutional approaches that process point clouds or polygon meshes. However, these 3D scene representations do not align well with the image-based nature of the visual-language models. Indeed, point cloud and 3D meshes typically have a lower resolution than images and the reconstructed 3D scene geometry might not project well to the underlying 2D image sequences used to compute pixel-aligned CLIP features. To address these challenges, we propose OpenNeRF which naturally operates on posed images and directly encodes the VLM features within the NeRF. This is similar in spirit to LERF, however our work shows that using pixel-wise VLM features (instead of global CLIP features) results in an overall less complex architecture without the need for additional DINO regularization. Our OpenNeRF further leverages NeRF's ability to render novel views and extract open-set VLM features from areas that are not well observed in the initial posed images. For 3D point cloud segmentation on the Replica dataset, OpenNeRF outperforms recent open-vocabulary methods such as LERF and OpenScene by at least +4.9 mIoU.
翻译:大型视觉-语言模型(如CLIP)能够以零样本方式从图像中分割任意概念,实现开集图像分割。这突破了传统闭集假设的局限(即模型仅能分割预定义训练集中的类别)。近期,针对三维场景的开集分割研究已见诸文献。这些方法深受闭集三维卷积方法的影响,后者通常处理点云或多边形网格。然而,此类三维场景表示与视觉-语言模型的图像本质存在不匹配:点云和三维网格的分辨率通常低于图像,且重建的三维场景几何结构可能无法准确映射到用于计算像素对齐CLIP特征的原始二维图像序列。为解决上述问题,我们提出OpenNeRF,该方法直接在带位姿的图像上运行,并将VLM特征编码至NeRF内部。该思路与LERF类似,但我们的研究表明,采用逐像素VLM特征(而非全局CLIP特征)可显著降低架构复杂度,且无需额外DINO正则化。此外,OpenNeRF充分利用NeRF渲染新视角的能力,从初始图像序列中未被充分观测的区域提取开集VLM特征。在Replica数据集的三维点云分割任务中,OpenNeRF相比最新的开集方法(如LERF和OpenScene)取得了至少+4.9 mIoU的性能提升。