We introduce Gaga, a framework that reconstructs and segments open-world 3D scenes by leveraging inconsistent 2D masks predicted by zero-shot class-agnostic segmentation models. Contrasted to prior 3D scene segmentation approaches that rely on video object tracking or contrastive learning methods, Gaga utilizes spatial information and effectively associates object masks across diverse camera poses through a novel 3D-aware memory bank. By eliminating the assumption of continuous view changes in training images, Gaga demonstrates robustness to variations in camera poses, particularly beneficial for sparsely sampled images, ensuring precise mask label consistency. Furthermore, Gaga accommodates 2D segmentation masks from diverse sources and demonstrates robust performance with different open-world zero-shot class-agnostic segmentation models, significantly enhancing its versatility. Extensive qualitative and quantitative evaluations demonstrate that Gaga performs favorably against state-of-the-art methods, emphasizing its potential for real-world applications such as 3D scene understanding and manipulation.
翻译:我们提出了Gaga框架,该框架通过利用零样本类别无关分割模型预测的不一致二维掩码,实现开放世界三维场景的重建与分割。与以往依赖视频目标跟踪或对比学习方法的二维场景分割方法不同,Gaga利用空间信息,并通过新颖的三维感知记忆库有效关联不同相机姿态下的物体掩码。通过消除训练图像中连续视角变化的假设,Gaga对相机姿态变化表现出鲁棒性,尤其适用于稀疏采样图像,确保了掩码标签的一致性。此外,Gaga兼容来自不同来源的二维分割掩码,并能与多种开放世界零样本类别无关分割模型协同工作,显著提升了其通用性。大量的定性与定量评估表明,Gaga在性能上优于现有先进方法,凸显了其在三维场景理解与操作等实际应用中的潜力。