We propose Compact and Swift Segmenting 3D Gaussians(CoSSegGaussians), a method for compact 3D-consistent scene segmentation at fast rendering speed with only RGB images input. Previous NeRF-based segmentation methods have relied on time-consuming neural scene optimization. While recent 3D Gaussian Splatting has notably improved speed, existing Gaussian-based segmentation methods struggle to produce compact masks, especially in zero-shot segmentation. This issue probably stems from their straightforward assignment of learnable parameters to each Gaussian, resulting in a lack of robustness against cross-view inconsistent 2D machine-generated labels. Our method aims to address this problem by employing Dual Feature Fusion Network as Gaussians' segmentation field. Specifically, we first optimize 3D Gaussians under RGB supervision. After Gaussian Locating, DINO features extracted from images are applied through explicit unprojection, which are further incorporated with spatial features from the efficient point cloud processing network. Feature aggregation is utilized to fuse them in a global-to-local strategy for compact segmentation features. Experimental results show that our model outperforms baselines on both semantic and panoptic zero-shot segmentation task, meanwhile consumes less than 10% inference time compared to NeRF-based methods. Code and more results will be available at https://David-Dou.github.io/CoSSegGaussians
翻译:本文提出紧凑快速场景分割三维高斯方法(CoSSegGaussians),一种仅需RGB图像输入即可实现三维一致场景紧凑分割且渲染速度极快的技术。现有基于NeRF的分割方法依赖耗时的神经场景优化。尽管近期三维高斯泼溅技术显著提升了速度,但现有基于高斯的分割方法难以生成紧凑掩膜,尤其在零样本分割任务中。该问题可能源于其直接为每个高斯分配可学习参数的方式,导致对跨视角不一致的二维机器生成标签缺乏鲁棒性。本文通过采用双特征融合网络作为高斯的分割场来解决该问题:首先在RGB监督下优化三维高斯;在高斯定位后,通过显式反投影应用图像提取的DINO特征,并与高效点云处理网络的空间特征融合;采用全局到局部策略进行特征聚合以获取紧凑分割特征。实验结果表明,本模型在语义分割和全景分割的零样本任务上均优于基线方法,同时推理时间相比基于NeRF的方法减少90%以上。代码及更多结果将在https://David-Dou.github.io/CoSSegGaussians 公开。