Recently, the Segment Anything Model (SAM) emerged as a powerful vision foundation model which is capable to segment anything in 2D images. This paper aims to generalize SAM to segment 3D objects. Rather than replicating the data acquisition and annotation procedure which is costly in 3D, we design an efficient solution, leveraging the Neural Radiance Field (NeRF) as a cheap and off-the-shelf prior that connects multi-view 2D images to the 3D space. We refer to the proposed solution as SA3D, for Segment Anything in 3D. It is only required to provide a manual segmentation prompt (e.g., rough points) for the target object in a single view, which is used to generate its 2D mask in this view with SAM. Next, SA3D alternately performs mask inverse rendering and cross-view self-prompting across various views to iteratively complete the 3D mask of the target object constructed with voxel grids. The former projects the 2D mask obtained by SAM in the current view onto 3D mask with guidance of the density distribution learned by the NeRF; The latter extracts reliable prompts automatically as the input to SAM from the NeRF-rendered 2D mask in another view. We show in experiments that SA3D adapts to various scenes and achieves 3D segmentation within minutes. Our research offers a generic and efficient methodology to lift a 2D vision foundation model to 3D, as long as the 2D model can steadily address promptable segmentation across multiple views. The project page is at https://jumpat.github.io/SA3D/.
翻译:近期,分割任意模型(SAM)作为强大的视觉基础模型,能够实现2D图像中的任意物体分割。本文旨在将SAM推广至3D物体分割。为避免3D领域成本高昂的数据采集与标注流程,我们设计了一种高效方案,利用神经辐射场(NeRF)作为廉价且即插即用的先验,将多视角2D图像与3D空间连接起来。我们将所提方案称为SA3D(Segment Anything in 3D)。该方法仅需在单视角下为目标物体提供手动分割提示(如粗略点),即可借助SAM生成该视角下的2D掩膜。随后,SA3D通过跨视角交替执行掩膜逆渲染和交叉视角自提示操作,逐步完善由体素网格构建的目标物体3D掩膜。其中,掩膜逆渲染将SAM在当前视角下生成的2D掩膜,在NeRF所学习密度分布的引导下投影至3D掩膜;交叉视角自提示则从NeRF渲染的另一视角2D掩膜中自动提取可靠提示,作为SAM的输入。实验表明,SA3D能适配多种场景并在数分钟内完成3D分割。本研究提供了一种通用高效的方法论,可将2D视觉基础模型提升至3D域,前提是该2D模型能稳定处理多视角下的可提示分割。项目页面:https://jumpat.github.io/SA3D/。