A central challenge in 3D scene perception via inverse graphics is robustly modeling the gap between 3D graphics and real-world data. We propose a novel 3D Neural Embedding Likelihood (3DNEL) over RGB-D images to address this gap. 3DNEL uses neural embeddings to predict 2D-3D correspondences from RGB and combines this with depth in a principled manner. 3DNEL is trained entirely from synthetic images and generalizes to real-world data. To showcase this capability, we develop a multi-stage inverse graphics pipeline that uses 3DNEL for 6D object pose estimation from real RGB-D images. Our method outperforms the previous state-of-the-art in sim-to-real pose estimation on the YCB-Video dataset, and improves robustness, with significantly fewer large-error predictions. Unlike existing bottom-up, discriminative approaches that are specialized for pose estimation, 3DNEL adopts a probabilistic generative formulation that jointly models multi-object scenes. This generative formulation enables easy extension of 3DNEL to additional tasks like object and camera tracking from video, using principled inference in the same probabilistic model without task specific retraining.
翻译:三维场景感知通过逆向图形学的核心挑战在于稳健地建模三维图形与现实数据之间的差距。我们提出了一种新颖的基于RGB-D图像的三维神经嵌入似然(3DNEL)方法来应对这一差距。3DNEL利用神经嵌入从RGB图像预测二维-三维对应关系,并基于原则性方式将其与深度信息相结合。3DNEL完全由合成图像训练,并泛化到现实数据。为展示这一能力,我们开发了一个多阶段逆向图形学流水线,利用3DNEL从真实RGB-D图像进行六维物体姿态估计。我们的方法在YCB-Video数据集上的仿真到真实姿态估计中超越了先前的最优技术,并显著减少了大幅误差预测,提升了鲁棒性。与现有专门针对姿态估计的自底向上、判别式方法不同,3DNEL采用概率生成式公式,联合建模多物体场景。这种生成式公式使得3DNEL能够轻松扩展至其他任务(如基于视频的物体与相机跟踪),并在同一概率模型中进行原则性推理,无需针对特定任务重新训练。