We propose Compact and Swift Segmenting 3D Gaussians(CoSSegGaussians), a method for compact 3D-consistent scene segmentation at fast rendering speed with only RGB images input. Previous NeRF-based segmentation methods have relied on time-consuming neural scene optimization. While recent 3D Gaussian Splatting has notably improved speed, existing Gaussian-based segmentation methods struggle to produce compact masks, especially in zero-shot segmentation. This issue probably stems from their straightforward assignment of learnable parameters to each Gaussian, resulting in a lack of robustness against cross-view inconsistent 2D machine-generated labels. Our method aims to address this problem by employing Dual Feature Fusion Network as Gaussians' segmentation field. Specifically, we first optimize 3D Gaussians under RGB supervision. After Gaussian Locating, DINO features extracted from images are applied through explicit unprojection, which are further incorporated with spatial features from the efficient point cloud processing network. Feature aggregation is utilized to fuse them in a global-to-local strategy for compact segmentation features. Experimental results show that our model outperforms baselines on both semantic and panoptic zero-shot segmentation task, meanwhile consumes less than 10\% inference time compared to NeRF-based methods. Code and more results will be available at https://David-Dou.github.io/CoSSegGaussians.
翻译:我们提出了紧凑且快速的场景分割3D高斯(CoSSegGaussians),一种仅以RGB图像作为输入、实现快速渲染速度下紧凑的3D一致场景分割的方法。以往的基于NeRF的分割方法依赖于耗时的神经场景优化。尽管最近的3D高斯泼溅(3D Gaussian Splatting)显著提升了速度,但现有的基于高斯的分割方法难以生成紧凑的掩码,尤其是在零样本分割中。这一问题可能源于其简单地将可学习参数分配给每个高斯,导致对跨视角不一致的二维机器生成标注缺乏鲁棒性。我们的方法旨在通过将双特征融合网络作为高斯的分割场来解决这一问题。具体而言,我们首先在RGB监督下优化3D高斯。在高斯定位后,从图像中提取的DINO特征通过显式反投影应用,并与来自高效点云处理网络的空间特征进一步融合。采用特征聚合以全局到局部的策略融合这些特征,从而获得紧凑的分割特征。实验结果表明,我们的模型在语义和全景零样本分割任务上均优于基线方法,同时与基于NeRF的方法相比,推理时间减少了不到10%。代码及更多结果将在 https://David-Dou.github.io/CoSSegGaussians 上提供。