The Segment Anything Model (SAM) has demonstrated its effectiveness in segmenting any object/part in various 2D images, yet its ability for 3D has not been fully explored. The real world is composed of numerous 3D scenes and objects. Due to the scarcity of accessible 3D data and high cost of its acquisition and annotation, lifting SAM to 3D is a challenging but valuable research avenue. With this in mind, we propose a novel framework to Segment Anything in 3D, named SA3D. Given a neural radiance field (NeRF) model, SA3D allows users to obtain the 3D segmentation result of any target object via only one-shot manual prompting in a single rendered view. With input prompts, SAM cuts out the target object from the according view. The obtained 2D segmentation mask is projected onto 3D mask grids via density-guided inverse rendering. 2D masks from other views are then rendered, which are mostly uncompleted but used as cross-view self-prompts to be fed into SAM again. Complete masks can be obtained and projected onto mask grids. This procedure is executed via an iterative manner while accurate 3D masks can be finally learned. SA3D can adapt to various radiance fields effectively without any additional redesigning. The entire segmentation process can be completed in approximately two minutes without any engineering optimization. Our experiments demonstrate the effectiveness of SA3D in different scenes, highlighting the potential of SAM in 3D scene perception. The project page is at https://jumpat.github.io/SA3D/.
翻译:分段任意模型(SAM)在二维图像中分割任意物体/部件方面已展现出显著效果,但其三维分割能力尚未被充分探索。现实世界由大量三维场景和物体构成。由于可获取的三维数据稀缺且采集与标注成本高昂,将SAM提升至三维领域是一项具有挑战性但价值重大的研究方向。为此,我们提出了一种名为SA3D的新型三维任意物体分割框架。给定一个神经辐射场(NeRF)模型,SA3D允许用户仅通过单次手动提示单个渲染视图即可获得任意目标物体的三维分割结果。根据输入提示,SAM从对应视图中分割出目标物体。通过密度引导的逆渲染技术,将所获二维分割掩码投影至三维掩码网格。随后渲染其他视角的二维掩码,这些掩码虽大多不完整,但可作为跨视角自提示再次输入至SAM。最终可获取完整掩码并投影至掩码网格。该过程以迭代方式执行,最终可学习到精确的三维掩码。SA3D无需额外重新设计即可有效适配各类辐射场。无需工程优化,整个分割过程可在约两分钟内完成。实验表明,SA3D在不同场景中均展现出有效性,凸显了SAM在三维场景感知中的潜力。项目页面:https://jumpat.github.io/SA3D/。