This paper shows that it is possible to learn models for monocular 3D reconstruction of articulated objects (e.g., horses, cows, sheep), using as few as 50-150 images labeled with 2D keypoints. Our proposed approach involves training category-specific keypoint estimators, generating 2D keypoint pseudo-labels on unlabeled web images, and using both the labeled and self-labeled sets to train 3D reconstruction models. It is based on two key insights: (1) 2D keypoint estimation networks trained on as few as 50-150 images of a given object category generalize well and generate reliable pseudo-labels; (2) a data selection mechanism can automatically create a "curated" subset of the unlabeled web images that can be used for training -- we evaluate four data selection methods. Coupling these two insights enables us to train models that effectively utilize web images, resulting in improved 3D reconstruction performance for several articulated object categories beyond the fully-supervised baseline. Our approach can quickly bootstrap a model and requires only a few images labeled with 2D keypoints. This requirement can be easily satisfied for any new object category. To showcase the practicality of our approach for predicting the 3D shape of arbitrary object categories, we annotate 2D keypoints on giraffe and bear images from COCO -- the annotation process takes less than 1 minute per image.
翻译:本文表明,可以利用仅包含50-150张标注了二维关键点的图像,来学习单目三维重建关节物体(例如马、牛、羊)的模型。我们提出的方法包括训练类别特定的关键点估计器,在未标注的网络图像上生成二维关键点伪标签,并利用标注集和自标注集共同训练三维重建模型。该方法基于两个关键见解:(1)在给定物体类别的仅50-150张图像上训练的二维关键点估计网络能够良好泛化并生成可靠的伪标签;(2)数据选择机制可自动从未标注的网络图像中创建“精选”子集用于训练——我们评估了四种数据选择方法。结合这两个见解,我们能够训练模型有效利用网络图像,从而在多个关节物体类别上取得优于全监督基线的三维重建性能。该方法可快速启动模型,仅需少量标注了二维关键点的图像。对于任何新物体类别,这一需求均易于满足。为展示该方法在预测任意类别物体三维形状方面的实用性,我们从COCO数据集中的长颈鹿和熊图像上标注了二维关键点——每张图像的标注耗时不足1分钟。