Prior works for reconstructing hand-held objects from a single image rely on direct 3D shape supervision which is challenging to gather in real world at scale. Consequently, these approaches do not generalize well when presented with novel objects in in-the-wild settings. While 3D supervision is a major bottleneck, there is an abundance of in-the-wild raw video data showing hand-object interactions. In this paper, we automatically extract 3D supervision (via multiview 2D supervision) from such raw video data to scale up the learning of models for hand-held object reconstruction. This requires tackling two key challenges: unknown camera pose and occlusion. For the former, we use hand pose (predicted from existing techniques, e.g. FrankMocap) as a proxy for object pose. For the latter, we learn data-driven 3D shape priors using synthetic objects from the ObMan dataset. We use these indirect 3D cues to train occupancy networks that predict the 3D shape of objects from a single RGB image. Our experiments on the MOW and HO3D datasets show the effectiveness of these supervisory signals at predicting the 3D shape for real-world hand-held objects without any direct real-world 3D supervision.
翻译:先前从单张图像重建手持物体的工作依赖直接的三维形状监督,这类数据在真实世界中大规模获取极具挑战性。因此,这些方法在遇到野外环境中的新物体时泛化能力不佳。尽管三维监督是主要瓶颈,但存在大量展示手物交互的野外原始视频数据。本文从这类原始视频数据中自动提取三维监督信号(通过多视角二维监督),以扩展手持物体重建模型的学习规模。这需要解决两个关键挑战:未知的相机姿态和遮挡问题。针对前者,我们利用现有技术(如FrankMocap)预测的手部姿态作为物体姿态的代理;针对后者,我们通过ObMan数据集中的合成物体学习数据驱动的三维形状先验。我们利用这些间接三维线索训练占据网络,使其能够从单张RGB图像预测物体的三维形状。在MOW和HO3D数据集上的实验表明,这些监督信号无需任何真实世界三维监督即可有效预测真实手持物体的三维形状。