In this paper, we propose a training scheme called OVSeg3R to learn open-vocabulary 3D instance segmentation from well-studied 2D perception models with the aid of 3D reconstruction. OVSeg3R directly adopts reconstructed scenes from 2D videos as input, avoiding costly manual adjustment while aligning input with real-world applications. By exploiting the 2D to 3D correspondences provided by 3D reconstruction models, OVSeg3R projects each view's 2D instance mask predictions, obtained from an open-vocabulary 2D model, onto 3D to generate annotations for the view's corresponding sub-scene. To avoid incorrectly introduced false positives as supervision due to partial annotations from 2D to 3D, we propose a View-wise Instance Partition algorithm, which partitions predictions to their respective views for supervision, stabilizing the training process. Furthermore, since 3D reconstruction models tend to over-smooth geometric details, clustering reconstructed points into representative super-points based solely on geometry, as commonly done in mainstream 3D segmentation methods, may overlook geometrically non-salient objects. We therefore introduce 2D Instance Boundary-aware Superpoint, which leverages 2D masks to constrain the superpoint clustering, preventing superpoints from violating instance boundaries. With these designs, OVSeg3R not only extends a state-of-the-art closed-vocabulary 3D instance segmentation model to open-vocabulary, but also substantially narrows the performance gap between tail and head classes, ultimately leading to an overall improvement of +2.3 mAP on the ScanNet200 benchmark. Furthermore, under the standard open-vocabulary setting, OVSeg3R surpasses previous methods by about +7.1 mAP on the novel classes, further validating its effectiveness.
翻译:本文提出了一种名为OVSeg3R的训练方案,旨在借助三维重建,从研究较为成熟的二维感知模型中学习开放词汇的三维实例分割。OVSeg3R直接采用从二维视频重建的三维场景作为输入,避免了昂贵的人工调整,同时使输入与现实应用场景对齐。通过利用三维重建模型提供的二维到三维对应关系,OVSeg3R将每个视角的二维实例掩码预测(由开放词汇二维模型获得)投影到三维空间,从而为该视角对应的子场景生成标注。为了避免由于二维到三维的部分标注而错误引入误报作为监督信号,我们提出了一种视角级实例划分算法,该算法将预测结果划分到其各自对应的视角进行监督,从而稳定训练过程。此外,由于三维重建模型倾向于过度平滑几何细节,仅基于几何信息将重建点聚类为代表性超点(如主流三维分割方法中常见做法)可能会忽略几何上不显著的对象。因此,我们引入了二维实例边界感知超点,该方法利用二维掩码来约束超点聚类,防止超点跨越实例边界。通过这些设计,OVSeg3R不仅将最先进的封闭词汇三维实例分割模型扩展到开放词汇场景,还显著缩小了尾部类别与头部类别之间的性能差距,最终在ScanNet200基准测试上实现了整体+2.3 mAP的提升。此外,在标准开放词汇设置下,OVSeg3R在未见类别上的性能超越先前方法约+7.1 mAP,进一步验证了其有效性。