We present SAM 3D, a generative model for visually grounded 3D object reconstruction, predicting geometry, texture, and layout from a single image. SAM 3D excels in natural images, where occlusion and scene clutter are common and visual recognition cues from context play a larger role. We achieve this with a human- and model-in-the-loop pipeline for annotating object shape, texture, and pose, providing visually grounded 3D reconstruction data at unprecedented scale. We learn from this data in a modern, multi-stage training framework that combines synthetic pretraining with real-world alignment, breaking the 3D "data barrier". We obtain significant gains over recent work, with at least a 5:1 win rate in human preference tests on real-world objects and scenes. We will release our code and model weights, an online demo, and a new challenging benchmark for in-the-wild 3D object reconstruction.
翻译:本文提出SAM 3D,一种基于视觉的三维物体重建生成模型,能够从单张图像预测几何结构、纹理和布局。SAM 3D在自然图像中表现卓越,此类图像普遍存在遮挡和场景杂乱现象,且上下文提供的视觉识别线索起着更重要的作用。我们通过人机协同标注流程实现物体形状、纹理和姿态的标注,以前所未有的规模提供基于视觉的三维重建数据。我们采用现代多阶段训练框架从该数据中学习,结合合成预训练与真实世界对齐,突破了三维'数据壁垒'。相较于近期研究,我们取得了显著提升,在真实物体与场景的人类偏好测试中至少获得5:1的胜率。我们将公开代码与模型权重、在线演示平台,以及针对野外三维物体重建的新挑战性基准测试集。