Current 3D scene segmentation methods are heavily dependent on manually annotated 3D training datasets. Such manual annotations are labor-intensive, and often lack fine-grained details. Importantly, models trained on this data typically struggle to recognize object classes beyond the annotated classes, i.e., they do not generalize well to unseen domains and require additional domain-specific annotations. In contrast, 2D foundation models demonstrate strong generalization and impressive zero-shot abilities, inspiring us to incorporate these characteristics from 2D models into 3D models. Therefore, we explore the use of image segmentation foundation models to automatically generate training labels for 3D segmentation. We propose Segment3D, a method for class-agnostic 3D scene segmentation that produces high-quality 3D segmentation masks. It improves over existing 3D segmentation models (especially on fine-grained masks), and enables easily adding new training data to further boost the segmentation performance -- all without the need for manual training labels.
翻译:当前三维场景分割方法严重依赖人工标注的三维训练数据集。此类人工标注不仅耗时费力,且常缺乏细粒度细节。更重要的是,基于此类数据训练的模型通常难以识别标注类别之外的物体类别,即无法很好地泛化到未见过领域,需要额外领域特定标注。相比之下,二维基础模型展现出强大的泛化能力与令人印象深刻的零样本能力,这启发我们将二维模型的这些特性融入三维模型。为此,我们探索利用图像分割基础模型自动生成三维分割的训练标签。我们提出Segment3D方法,这是一种生成高质量三维分割掩模的类无关三维场景分割方法。该方法不仅比现有三维分割模型表现更优(尤其在细粒度掩模方面),还能轻松添加新训练数据以进一步提升分割性能——全程无需人工训练标签。