Obtaining accurate 3D object poses is vital for numerous computer vision applications, such as 3D reconstruction and scene understanding. However, annotating real-world objects is time-consuming and challenging. While synthetically generated training data is a viable alternative, the domain shift between real and synthetic data is a significant challenge. In this work, we aim to narrow the performance gap between models trained on synthetic data and few real images and fully supervised models trained on large-scale data. We achieve this by approaching the problem from two perspectives: 1) We introduce SyntheticP3D, a new synthetic dataset for object pose estimation generated from CAD models and enhanced with a novel algorithm. 2) We propose a novel approach (CC3D) for training neural mesh models that perform pose estimation via inverse rendering. In particular, we exploit the spatial relationships between features on the mesh surface and a contrastive learning scheme to guide the domain adaptation process. Combined, these two approaches enable our models to perform competitively with state-of-the-art models using only 10% of the respective real training images, while outperforming the SOTA model by 10.4% with a threshold of pi/18 using only 50% of the real training data. Our trained model further demonstrates robust generalization to out-of-distribution scenarios despite being trained with minimal real data.
翻译:获取精确的3D物体姿态对众多计算机视觉应用至关重要,例如3D重建和场景理解。然而,对真实世界物体进行标注既耗时又充满挑战。虽然合成训练数据是可行的替代方案,但真实数据与合成数据之间的域偏移构成了重大挑战。本研究旨在缩小基于合成数据和少量真实图像训练的模型与基于大规模数据训练的全监督模型之间的性能差距。我们通过从两个角度解决该问题实现这一目标:1)我们提出SyntheticP3D,这是一个由CAD模型生成并经新算法增强的新型合成物体姿态估计数据集;2)我们提出一种名为CC3D的新方法,用于训练通过逆渲染进行姿态估计的神经网格模型。特别地,我们利用网格表面特征间的空间关系以及对比学习机制来引导域适应过程。结合这两种方法,我们的模型仅需使用相应真实训练图像的10%即可与最先进模型竞争,而在使用50%真实训练数据且阈值设为π/18时,其性能超出当前最优模型10.4%。尽管仅使用极少的真实数据进行训练,我们的模型在分布外场景中仍展现出鲁棒的泛化能力。