Perceiving 3D structures from RGB images based on CAD model primitives can enable an effective, efficient 3D object-based representation of scenes. However, current approaches rely on supervision from expensive annotations of CAD models associated with real images, and encounter challenges due to the inherent ambiguities in the task -- both in depth-scale ambiguity in monocular perception, as well as inexact matches of CAD database models to real observations. We thus propose DiffCAD, the first weakly-supervised probabilistic approach to CAD retrieval and alignment from an RGB image. We formulate this as a conditional generative task, leveraging diffusion to learn implicit probabilistic models capturing the shape, pose, and scale of CAD objects in an image. This enables multi-hypothesis generation of different plausible CAD reconstructions, requiring only a few hypotheses to characterize ambiguities in depth/scale and inexact shape matches. Our approach is trained only on synthetic data, leveraging monocular depth and mask estimates to enable robust zero-shot adaptation to various real target domains. Despite being trained solely on synthetic data, our multi-hypothesis approach can even surpass the supervised state-of-the-art on the Scan2CAD dataset by 5.9% with 8 hypotheses.
翻译:摘要:基于CAD模型基元从RGB图像中感知三维结构,能够实现对场景的高效、有效的三维物体表征。然而,现有方法依赖于与真实图像关联的昂贵CAD模型标注进行监督,并且由于任务固有的歧义性(包括单目感知中的深度-尺度歧义,以及CAD数据库模型与真实观测之间的不精确匹配)而面临挑战。为此,我们提出DiffCAD,这是首个基于RGB图像进行CAD检索与对齐的弱监督概率性方法。我们将此问题形式化为条件生成任务,利用扩散模型学习隐式概率分布,以捕捉图像中CAD物体的形状、姿态和尺度。这能够生成多个合理的CAD重建假设,仅需少量假设即可表征深度/尺度歧义和不精确的形状匹配。我们的方法仅使用合成数据进行训练,并利用单目深度和掩膜估计,实现了对多种真实目标域鲁棒的零样本自适应。尽管仅使用合成数据训练,我们的多假设方法在Scan2CAD数据集上以8个假设即可超越有监督的现有技术5.9%。