The lifting of 3D structure and camera from 2D landmarks is at the cornerstone of the entire discipline of computer vision. Traditional methods have been confined to specific rigid objects, such as those in Perspective-n-Point (PnP) problems, but deep learning has expanded our capability to reconstruct a wide range of object classes (e.g. C3DPO and PAUL) with resilience to noise, occlusions, and perspective distortions. All these techniques, however, have been limited by the fundamental need to establish correspondences across the 3D training data -- significantly limiting their utility to applications where one has an abundance of "in-correspondence" 3D data. Our approach harnesses the inherent permutation equivariance of transformers to manage varying number of points per 3D data instance, withstands occlusions, and generalizes to unseen categories. We demonstrate state of the art performance across 2D-3D lifting task benchmarks. Since our approach can be trained across such a broad class of structures we refer to it simply as a 3D Lifting Foundation Model (3D-LFM) -- the first of its kind.
翻译:从二维地标提升三维结构和相机是计算机视觉整个学科的基石。传统方法局限于特定刚性物体(如透视n点问题中的物体),而深度学习扩展了我们重建多种物体类别(如C3DPO和PAUL)的能力,且对噪声、遮挡和透视畸变具有鲁棒性。然而,所有这些技术都受限于三维训练数据中建立对应关系的基本需求——这严重限制了它们在拥有大量“对应”三维数据的应用中的实用性。我们的方法利用Transformer固有的置换等变性来管理每个三维数据实例中不同数量的点,抵抗遮挡,并泛化到未见类别。我们在二维到三维提升任务基准上展示了最先进的性能。由于我们的方法可以在如此广泛的结构类别上进行训练,我们将其简称为三维提升基础模型(3D-LFM)——这是首个此类模型。