Recently, methods have been proposed for 3D open-vocabulary semantic segmentation. Such methods are able to segment scenes into arbitrary classes given at run-time using their text description. In this paper, we propose to our knowledge the first algorithm for open-vocabulary panoptic segmentation, simultaneously performing both semantic and instance segmentation. Our algorithm, Panoptic Vision-Language Feature Fields (PVLFF) learns a feature field of the scene, jointly learning vision-language features and hierarchical instance features through a contrastive loss function from 2D instance segment proposals on input frames. Our method achieves comparable performance against the state-of-the-art close-set 3D panoptic systems on the HyperSim, ScanNet and Replica dataset and outperforms current 3D open-vocabulary systems in terms of semantic segmentation. We additionally ablate our method to demonstrate the effectiveness of our model architecture. Our code will be available at https://github.com/ethz-asl/autolabel.
翻译:近期,研究者提出了三维开放词汇语义分割的方法。这类方法能够在运行时通过文本描述将场景分割为任意类别。本文首次提出了开放词汇全景分割算法,同时实现语义分割与实例分割。所提出的算法——全景视觉-语言特征场(PVLFF)——通过从输入帧的二维实例分割提案中提取对比损失函数,联合学习场景的视觉-语言特征与层次化实例特征,从而构建场景的特征场。本方法在HyperSim、ScanNet和Replica数据集上取得了与当前最先进的封闭集三维全景分割系统相当的性能,并在语义分割任务中优于现有三维开放词汇系统。此外,通过消融实验验证了模型架构的有效性。代码将开源于https://github.com/ethz-asl/autolabel。