Acquiring labeled 6D poses from real images is an expensive and time-consuming task. Though massive amounts of synthetic RGB images are easy to obtain, the models trained on them suffer from noticeable performance degradation due to the synthetic-to-real domain gap. To mitigate this degradation, we propose a practical self-supervised domain adaptation approach that takes advantage of real RGB(-D) data without needing real pose labels. We first pre-train the model with synthetic RGB images and then utilize real RGB(-D) images to fine-tune the pre-trained model. The fine-tuning process is self-supervised by the RGB-based pose-aware consistency and the depth-guided object distance pseudo-label, which does not require the time-consuming online differentiable rendering. We build our domain adaptation method based on the recent pose estimator SC6D and evaluate it on the YCB-Video dataset. We experimentally demonstrate that our method achieves comparable performance against its fully-supervised counterpart while outperforming existing state-of-the-art approaches.
翻译:获取真实图像的标注6D姿态是一项昂贵且耗时的任务。尽管大量合成RGB图像易于获得,但由于合成域与真实域之间的差距,基于这些图像训练的模型会出现明显的性能下降。为缓解这一退化,我们提出了一种实用的自监督域适应方法,该方法利用真实RGB(-D)数据,且无需真实姿态标签。我们首先使用合成RGB图像预训练模型,然后利用真实RGB(-D)图像微调预训练模型。微调过程通过基于RGB的姿态一致性约束和深度引导的物体距离伪标签进行自监督,无需耗时的在线可微分渲染。我们将所提出的域适应方法基于最新姿态估计器SC6D构建,并在YCB-Video数据集上进行评估。实验证明,我们的方法在达到与全监督方法相当性能的同时,优于现有的最先进方法。