Promptable segmentation has emerged as a powerful paradigm in computer vision, enabling users to guide models in parsing complex scenes with prompts such as clicks, boxes, or textual cues. Recent advances, exemplified by the Segment Anything Model (SAM), have extended this paradigm to videos and multi-view images. However, the lack of 3D awareness often leads to inconsistent results, necessitating costly per-scene optimization to enforce 3D consistency. In this work, we introduce MV-SAM, a framework for multi-view segmentation that achieves 3D consistency using pointmaps -- 3D points reconstructed from unposed images by recent visual geometry models. Leveraging the pixel-point one-to-one correspondence of pointmaps, MV-SAM lifts images and prompts into 3D space, eliminating the need for explicit 3D networks or annotated 3D data. Specifically, MV-SAM extends SAM by lifting image embeddings from its pretrained encoder into 3D point embeddings, which are decoded by a transformer using cross-attention with 3D prompt embeddings. This design aligns 2D interactions with 3D geometry, enabling the model to implicitly learn consistent masks across views through 3D positional embeddings. Trained on the SA-1B dataset, our method generalizes well across domains, outperforming SAM2-Video and achieving comparable performance with per-scene optimization baselines on NVOS, SPIn-NeRF, ScanNet++, uCo3D, and DL3DV benchmarks. Code will be released.
翻译:可提示分割已成为计算机视觉领域一个强大的范式,它允许用户通过点击、框选或文本提示等方式引导模型解析复杂场景。以Segment Anything Model(SAM)为代表的近期进展,已将该范式扩展至视频和多视角图像。然而,由于缺乏三维感知能力,此类方法常产生不一致的结果,往往需要昂贵的逐场景优化来强制实现三维一致性。本文提出MV-SAM,一个用于多视角分割的框架,它利用点图——即通过近期视觉几何模型从未标定图像重建出的三维点云——来实现三维一致性。借助点图所固有的像素-点一一对应关系,MV-SAM将图像与提示信息提升至三维空间,从而无需显式的三维网络或标注的三维数据。具体而言,MV-SAM通过将预训练编码器中的图像嵌入提升为三维点嵌入来扩展SAM,这些点嵌入由一个Transformer解码器通过与三维提示嵌入的交叉注意力机制进行解码。该设计将二维交互与三维几何对齐,使模型能够通过三维位置嵌入隐式地学习跨视角一致的掩码。在SA-1B数据集上训练后,我们的方法展现出良好的跨领域泛化能力,在NVOS、SPIn-NeRF、ScanNet++、uCo3D和DL3DV等基准测试中,其性能优于SAM2-Video,并与需要逐场景优化的基线方法相当。代码将公开。