We propose Compact and Swift Segmenting 3D Gaussians(CoSSegGaussians), a method for compact 3D-consistent scene segmentation at fast rendering speed with only RGB images input. Previous NeRF-based 3D segmentation methods have relied on implicit or voxel neural scene representation and ray-marching volume rendering which are time consuming. Recent 3D Gaussian Splatting significantly improves the rendering speed, however, existing Gaussians-based segmentation methods(eg: Gaussian Grouping) fail to provide compact segmentation masks especially in zero-shot segmentation, which is mainly caused by the lack of robustness and compactness for straightforwardly assigning learnable parameters to each Gaussian when encountering inconsistent 2D machine-generated labels. Our method aims to achieve compact and reliable zero-shot scene segmentation swiftly by mapping fused spatial and semantically meaningful features for each Gaussian point with a shallow decoding network. Specifically, our method firstly optimizes Gaussian points' position, convariance and color attributes under the supervision of RGB images. After Gaussian Locating, we distill multi-scale DINO features extracted from images through unprojection to each Gaussian, which is then incorporated with spatial features from the fast point features processing network, i.e. RandLA-Net. Then the shallow decoding MLP is applied to the multi-scale fused features to obtain compact segmentation. Experimental results show that our model can perform high-quality zero-shot scene segmentation, as our model outperforms other segmentation methods on both semantic and panoptic segmentation task, meanwhile consumes approximately only 10% segmenting time compared to NeRF-based segmentation. Code and more results will be available at https://David-Dou.github.io/CoSSegGaussians
翻译:我们提出紧凑且快速的场景分割3D高斯方法(CoSSegGaussians),一种仅需RGB图像输入即可实现紧凑的3D一致性场景分割并达到快速渲染速度的方法。以往的基于NeRF的3D分割方法依赖于隐式或体素神经场景表示以及光线步进体渲染,这一过程耗时较长。近年来的3D高斯泼溅技术显著提升了渲染速度,然而,现有基于高斯的分割方法(例如Gaussian Grouping)无法提供紧凑的分割掩码,尤其是在零样本分割场景中,这主要是由于在遇到不一致的2D机器生成标签时,为每个高斯直接分配可学习参数缺乏鲁棒性和紧凑性。我们的方法旨在通过利用浅层解码网络,为每个高斯点映射融合后的空间特征与语义上有意义的特征,从而快速实现紧凑且可靠的零样本场景分割。具体而言,我们的方法首先在RGB图像的监督下优化高斯点的位置、协方差和颜色属性。在高斯定位之后,我们通过反投影将图像中提取的多尺度DINO特征蒸馏到每个高斯点上,随后将其与来自快速点特征处理网络(即RandLA-Net)的空间特征相结合。然后,将浅层解码MLP应用于多尺度融合特征,以获得紧凑的分割结果。实验结果表明,我们的模型能够执行高质量的零样本场景分割,在语义分割和全景分割任务上均优于其他分割方法,同时其分割时间仅为基于NeRF分割方法的约10%。代码及更多结果将在https://David-Dou.github.io/CoSSegGaussians发布。