We present a method for reconstructing accurate and consistent 3D hands from a monocular video. We observe that detected 2D hand keypoints and the image texture provide important cues about the geometry and texture of the 3D hand, which can reduce or even eliminate the requirement on 3D hand annotation. Thus we propose ${\rm {S}^{2}HAND}$, a self-supervised 3D hand reconstruction model, that can jointly estimate pose, shape, texture, and the camera viewpoint from a single RGB input through the supervision of easily accessible 2D detected keypoints. We leverage the continuous hand motion information contained in the unlabeled video data and propose ${\rm {S}^{2}HAND(V)}$, which uses a set of weights shared ${\rm {S}^{2}HAND}$ to process each frame and exploits additional motion, texture, and shape consistency constrains to promote more accurate hand poses and more consistent shapes and textures. Experiments on benchmark datasets demonstrate that our self-supervised approach produces comparable hand reconstruction performance compared with the recent full-supervised methods in single-frame as input setup, and notably improves the reconstruction accuracy and consistency when using video training data.
翻译:我们提出一种从单目视频中重建精确且一致3D手部的方法。我们发现检测到的2D手部关键点和图像纹理提供了关于3D手部几何形状与纹理的重要线索,从而可以降低甚至消除对3D手部标注的需求。因此我们提出${\rm {S}^{2}HAND}$——一种自监督3D手部重建模型,该模型通过易于获取的2D检测关键点的监督信号,从单张RGB输入中联合估计姿态、形状、纹理和相机视角。我们利用未标注视频数据中包含的连续手部运动信息,提出${\rm {S}^{2}HAND(V)}$,该方法使用一组共享权重的${\rm {S}^{2}HAND}$处理每一帧,并利用额外的运动、纹理和形状一致性约束来促进更精确的手部姿态以及更一致的形状与纹理。在基准数据集上的实验表明,在单帧输入设置下,我们的自监督方法可与近期全监督方法媲美手部重建性能,并在使用视频训练数据时显著提升了重建精度与一致性。