OVSeg3R：通过三维重建从二维学习开放词汇实例分割 (OVSeg3R: Learn Open-vocabulary Instance Segmentation from 2D via 3D Reconstruction)

In this paper, we propose a training scheme called OVSeg3R to learn open-vocabulary 3D instance segmentation from well-studied 2D perception models with the aid of 3D reconstruction. OVSeg3R directly adopts reconstructed scenes from 2D videos as input, avoiding costly manual adjustment while aligning input with real-world applications. By exploiting the 2D to 3D correspondences provided by 3D reconstruction models, OVSeg3R projects each view's 2D instance mask predictions, obtained from an open-vocabulary 2D model, onto 3D to generate annotations for the view's corresponding sub-scene. To avoid incorrectly introduced false positives as supervision due to partial annotations from 2D to 3D, we propose a View-wise Instance Partition algorithm, which partitions predictions to their respective views for supervision, stabilizing the training process. Furthermore, since 3D reconstruction models tend to over-smooth geometric details, clustering reconstructed points into representative super-points based solely on geometry, as commonly done in mainstream 3D segmentation methods, may overlook geometrically non-salient objects. We therefore introduce 2D Instance Boundary-aware Superpoint, which leverages 2D masks to constrain the superpoint clustering, preventing superpoints from violating instance boundaries. With these designs, OVSeg3R not only extends a state-of-the-art closed-vocabulary 3D instance segmentation model to open-vocabulary, but also substantially narrows the performance gap between tail and head classes, ultimately leading to an overall improvement of +2.3 mAP on the ScanNet200 benchmark. Furthermore, under the standard open-vocabulary setting, OVSeg3R surpasses previous methods by about +7.1 mAP on the novel classes, further validating its effectiveness.

翻译：本文提出了一种名为OVSeg3R的训练方案，旨在借助三维重建，从研究较为成熟的二维感知模型中学习开放词汇的三维实例分割。OVSeg3R直接采用从二维视频重建的三维场景作为输入，避免了昂贵的人工调整，同时使输入与现实应用场景对齐。通过利用三维重建模型提供的二维到三维对应关系，OVSeg3R将每个视角的二维实例掩码预测（由开放词汇二维模型获得）投影到三维空间，从而为该视角对应的子场景生成标注。为了避免由于二维到三维的部分标注而错误引入误报作为监督信号，我们提出了一种视角级实例划分算法，该算法将预测结果划分到其各自对应的视角进行监督，从而稳定训练过程。此外，由于三维重建模型倾向于过度平滑几何细节，仅基于几何信息将重建点聚类为代表性超点（如主流三维分割方法中常见做法）可能会忽略几何上不显著的对象。因此，我们引入了二维实例边界感知超点，该方法利用二维掩码来约束超点聚类，防止超点跨越实例边界。通过这些设计，OVSeg3R不仅将最先进的封闭词汇三维实例分割模型扩展到开放词汇场景，还显著缩小了尾部类别与头部类别之间的性能差距，最终在ScanNet200基准测试上实现了整体+2.3 mAP的提升。此外，在标准开放词汇设置下，OVSeg3R在未见类别上的性能超越先前方法约+7.1 mAP，进一步验证了其有效性。