Perceiving 3D structures from RGB images based on CAD model primitives can enable an effective, efficient 3D object-based representation of scenes. However, current approaches rely on supervision from expensive annotations of CAD models associated with real images, and encounter challenges due to the inherent ambiguities in the task -- both in depth-scale ambiguity in monocular perception, as well as inexact matches of CAD database models to real observations. We thus propose DiffCAD, the first weakly-supervised probabilistic approach to CAD retrieval and alignment from an RGB image. We formulate this as a conditional generative task, leveraging diffusion to learn implicit probabilistic models capturing the shape, pose, and scale of CAD objects in an image. This enables multi-hypothesis generation of different plausible CAD reconstructions, requiring only a few hypotheses to characterize ambiguities in depth/scale and inexact shape matches. Our approach is trained only on synthetic data, leveraging monocular depth and mask estimates to enable robust zero-shot adaptation to various real target domains. Despite being trained solely on synthetic data, our multi-hypothesis approach can even surpass the supervised state-of-the-art on the Scan2CAD dataset by 5.9% with 8 hypotheses.
翻译:基于CAD模型基元从RGB图像中感知三维结构,能够实现场景的有效、高效三维对象化表征。然而,现有方法依赖于真实图像关联CAD模型的高成本标注监督,且面临任务固有模糊性带来的挑战——包括单目感知中的深度-尺度模糊性,以及CAD数据库模型与真实观测间的不精确匹配。为此,我们提出DiffCAD,首个从RGB图像进行CAD检索与对齐的弱监督概率方法。我们将该任务构建为条件生成问题,利用扩散模型学习捕获图像中CAD对象形状、姿态和尺度的隐式概率模型。该方法支持多假设生成,仅需少量假设即可表征深度/尺度模糊性与不精确形状匹配的多种合理CAD重建结果。我们的方法仅使用合成数据进行训练,通过单目深度与掩码估计实现鲁棒的零样本适应,可迁移至多种真实目标域。尽管完全基于合成数据训练,我们在Scan2CAD数据集上采用8个假设的多假设方法甚至以5.9%的优势超越了全监督的最先进方法。