Recent success of vision foundation models have shown promising performance for the 2D perception tasks. However, it is difficult to train a 3D foundation network directly due to the limited dataset and it remains under explored whether existing foundation models can be lifted to 3D space seamlessly. In this paper, we present PointSeg, a novel training-free paradigm that leverages off-the-shelf vision foundation models to address 3D scene perception tasks. PointSeg can segment anything in 3D scene by acquiring accurate 3D prompts to align their corresponding pixels across frames. Concretely, we design a two-branch prompts learning structure to construct the 3D point-box prompts pairs, combining with the bidirectional matching strategy for accurate point and proposal prompts generation. Then, we perform the iterative post-refinement adaptively when cooperated with different vision foundation models. Moreover, we design a affinity-aware merging algorithm to improve the final ensemble masks. PointSeg demonstrates impressive segmentation performance across various datasets, all without training. Specifically, our approach significantly surpasses the state-of-the-art specialist training-free model by 14.1$\%$, 12.3$\%$, and 12.6$\%$ mAP on ScanNet, ScanNet++, and KITTI-360 datasets, respectively. On top of that, PointSeg can incorporate with various foundation models and even surpasses the specialist training-based methods by 3.4$\%$-5.4$\%$ mAP across various datasets, serving as an effective generalist model.
翻译:视觉基础模型的最新成功在二维感知任务中展现出令人瞩目的性能。然而,由于数据集有限,直接训练三维基础网络十分困难,且现有基础模型能否无缝迁移至三维空间仍待探索。本文提出PointSeg,一种新颖的无训练范式,利用现成的视觉基础模型处理三维场景感知任务。PointSeg能够通过获取精确的三维提示来对齐跨帧的对应像素,从而分割三维场景中的任意物体。具体而言,我们设计了一种双分支提示学习结构来构建三维点-框提示对,并结合双向匹配策略以生成精确的点提示与提案提示。随后,在与不同视觉基础模型协作时,我们自适应地执行迭代后优化。此外,我们设计了一种亲和感知的融合算法以提升最终集成掩码的质量。PointSeg在多种数据集上均展现出卓越的分割性能,且全程无需训练。具体而言,我们的方法在ScanNet、ScanNet++和KITTI-360数据集上的mAP分别显著超越当前最先进的专业化无训练模型14.1$\%$、12.3$\%$和12.6$\%$。更重要的是,PointSeg能够兼容多种基础模型,甚至在多个数据集上以3.4$\%$-5.4$\%$的mAP优势超越基于训练的专业化方法,成为一种高效的通用模型。