Open-YOLO 3D: Towards Fast and Accurate Open-Vocabulary 3D Instance Segmentation

Recent works on open-vocabulary 3D instance segmentation show strong promise, but at the cost of slow inference speed and high computation requirements. This high computation cost is typically due to their heavy reliance on 3D clip features, which require computationally expensive 2D foundation models like Segment Anything (SAM) and CLIP for multi-view aggregation into 3D. As a consequence, this hampers their applicability in many real-world applications that require both fast and accurate predictions. To this end, we propose a fast yet accurate open-vocabulary 3D instance segmentation approach, named Open-YOLO 3D, that effectively leverages only 2D object detection from multi-view RGB images for open-vocabulary 3D instance segmentation. We address this task by generating class-agnostic 3D masks for objects in the scene and associating them with text prompts. We observe that the projection of class-agnostic 3D point cloud instances already holds instance information; thus, using SAM might only result in redundancy that unnecessarily increases the inference time. We empirically find that a better performance of matching text prompts to 3D masks can be achieved in a faster fashion with a 2D object detector. We validate our Open-YOLO 3D on two benchmarks, ScanNet200 and Replica, under two scenarios: (i) with ground truth masks, where labels are required for given object proposals, and (ii) with class-agnostic 3D proposals generated from a 3D proposal network. Our Open-YOLO 3D achieves state-of-the-art performance on both datasets while obtaining up to $\sim$16$\times$ speedup compared to the best existing method in literature. On ScanNet200 val. set, our Open-YOLO 3D achieves mean average precision (mAP) of 24.7\% while operating at 22 seconds per scene. Code and model are available at github.com/aminebdj/OpenYOLO3D.

翻译：现有的开集词汇3D实例分割方法展现出强大潜力，但其代价是推理速度慢且计算需求高。这种高计算成本通常源于它们严重依赖3D CLIP特征，这需要计算昂贵的2D基础模型（如Segment Anything (SAM)和CLIP）进行多视角聚合以生成3D特征。因此，这阻碍了它们在许多需要快速且准确预测的现实应用中的适用性。为此，我们提出了一种快速而准确的开集词汇3D实例分割方法，命名为Open-YOLO 3D。该方法仅有效利用来自多视角RGB图像的2D目标检测来实现开集词汇3D实例分割。我们通过为场景中的物体生成类别无关的3D掩码并将其与文本提示相关联来解决此任务。我们观察到，类别无关的3D点云实例的投影已经包含了实例信息；因此，使用SAM可能只会导致冗余，不必要地增加推理时间。我们通过实验发现，使用2D目标检测器可以更快地实现将文本提示与3D掩码匹配的更好性能。我们在两个基准数据集ScanNet200和Replica上验证了我们的Open-YOLO 3D，测试了两种场景：(i) 使用真实掩码，其中需要为给定的物体提议提供标签；(ii) 使用由3D提议网络生成的类别无关的3D提议。我们的Open-YOLO 3D在两个数据集上都达到了最先进的性能，同时与文献中现有的最佳方法相比，获得了高达$\sim$16$\times$的加速。在ScanNet200验证集上，我们的Open-YOLO 3D实现了24.7\%的平均精度均值(mAP)，同时每场景处理时间为22秒。代码和模型可在github.com/aminebdj/OpenYOLO3D获取。