3D scene understanding for robotic applications exhibits a unique set of requirements including real-time inference, object-centric latent representation learning, accurate 6D pose estimation and 3D reconstruction of objects. Current methods for scene understanding typically rely on a combination of trained models paired with either an explicit or learnt volumetric representation, all of which have their own drawbacks and limitations. We introduce DreamUp3D, a novel Object-Centric Generative Model (OCGM) designed explicitly to perform inference on a 3D scene informed only by a single RGB-D image. DreamUp3D is a self-supervised model, trained end-to-end, and is capable of segmenting objects, providing 3D object reconstructions, generating object-centric latent representations and accurate per-object 6D pose estimates. We compare DreamUp3D to baselines including NeRFs, pre-trained CLIP-features, ObSurf, and ObPose, in a range of tasks including 3D scene reconstruction, object matching and object pose estimation. Our experiments show that our model outperforms all baselines by a significant margin in real-world scenarios displaying its applicability for 3D scene understanding tasks while meeting the strict demands exhibited in robotics applications.
翻译:面向机器人应用的3D场景理解具有一系列独特需求,包括实时推理、以物体为中心的潜在表示学习、精确的6D姿态估计以及物体3D重建。当前场景理解方法通常依赖训练模型与显式或学习式体素表示的组合,但这些方法各有局限。我们提出DreamUp3D——一种新型以物体为中心的生成模型(OCGM),专门设计用于仅通过单张RGB-D图像对3D场景进行推理。DreamUp3D是自监督模型,采用端到端训练,能够实现物体分割、提供3D物体重建、生成以物体为中心的潜在表示以及每个物体精确的6D姿态估计。我们将DreamUp3D与包括NeRF、预训练CLIP特征、ObSurf和ObPose在内的基线方法在3D场景重建、物体匹配和物体姿态估计等任务中进行比较。实验表明,在真实场景中我们的模型显著优于所有基线方法,既展现了其在3D场景理解任务中的适用性,同时满足了机器人应用对性能的严苛要求。