Reconstructing 3D shape and pose of static objects from a single image is an essential task for various industries, including robotics, augmented reality, and digital content creation. This can be done by directly predicting 3D shape in various representations or by retrieving CAD models from a database and predicting their alignments. Directly predicting 3D shapes often produces unrealistic, overly smoothed or tessellated shapes. Retrieving CAD models ensures realistic shapes but requires robust and accurate alignment. Learning to directly predict CAD model poses from image features is challenging and inaccurate. Works, such as ROCA, compute poses from predicted normalised object coordinates which can be more accurate but are susceptible to systematic failure. SPARC demonstrates that following a ''render-and-compare'' approach where a network iteratively improves upon its own predictions achieves accurate alignments. Nevertheless, it performs individual CAD alignment for every object detected in an image. This approach is slow when applied to many objects as the time complexity increases linearly with the number of objects and can not learn inter-object relations. Introducing a new network architecture Multi-SPARC we learn to perform CAD model alignments for multiple detected objects jointly. Compared to other single-view methods we achieve state-of-the-art performance on the challenging real-world dataset ScanNet. By improving the instance alignment accuracy from 31.8% to 40.3% we perform similar to state-of-the-art multi-view methods.
翻译:从单张图像重建静态物体的三维形状和姿态是机器人、增强现实和数字内容创作等多个领域的关键任务。这可以通过直接预测各种表示形式的三维形状实现,或通过从数据库中检索CAD模型并预测其位姿来完成。直接预测三维形状往往会产生不真实、过度平滑或过度网格化的形状;而检索CAD模型虽能确保形状真实性,但需要鲁棒且精确的位姿对齐。直接从图像特征学习预测CAD模型位姿既具挑战性又不精确。诸如ROCA等方法通过预测归一化的物体坐标来计算位姿,虽能提高精度,但易出现系统故障。SPARC证明了采用"渲染与比对"方法——让网络通过迭代优化自身的预测——能够实现精确对齐。然而,该方法针对图像中每个检测到的物体独立执行CAD对齐,当处理大量物体时速度缓慢(时间复杂度随物体数量线性增长),且无法学习物体间的相互关系。我们提出名为Multi-SPARC的新型网络架构,可联合学习对多个检测到的物体执行CAD模型对齐。与其他单视角方法相比,我们在具有挑战性的真实场景数据集ScanNet上达到了最先进的性能,将实例对齐准确率从31.8%提升至40.3%,表现与最先进的多视角方法相当。