Decaf: Monocular Deformation Capture for Face and Hand Interactions

Existing methods for 3D tracking from monocular RGB videos predominantly consider articulated and rigid objects. Modelling dense non-rigid object deformations in this setting remained largely unaddressed so far, although such effects can improve the realism of the downstream applications such as AR/VR and avatar communications. This is due to the severe ill-posedness of the monocular view setting and the associated challenges. While it is possible to naively track multiple non-rigid objects independently using 3D templates or parametric 3D models, such an approach would suffer from multiple artefacts in the resulting 3D estimates such as depth ambiguity, unnatural intra-object collisions and missing or implausible deformations. Hence, this paper introduces the first method that addresses the fundamental challenges depicted above and that allows tracking human hands interacting with human faces in 3D from single monocular RGB videos. We model hands as articulated objects inducing non-rigid face deformations during an active interaction. Our method relies on a new hand-face motion and interaction capture dataset with realistic face deformations acquired with a markerless multi-view camera system. As a pivotal step in its creation, we process the reconstructed raw 3D shapes with position-based dynamics and an approach for non-uniform stiffness estimation of the head tissues, which results in plausible annotations of the surface deformations, hand-face contact regions and head-hand positions. At the core of our neural approach are a variational auto-encoder supplying the hand-face depth prior and modules that guide the 3D tracking by estimating the contacts and the deformations. Our final 3D hand and face reconstructions are realistic and more plausible compared to several baselines applicable in our setting, both quantitatively and qualitatively. https://vcai.mpi-inf.mpg.de/projects/Decaf

翻译：现有基于单目RGB视频的三维追踪方法主要关注铰接式关节物体与刚体。在此场景下建模密集非刚体形变至今仍未得到充分解决，尽管此类效果能增强增强现实/虚拟现实及数字人通信等下游应用的真实感。这一困境源于单目视角设置的严重病态性及其相关挑战。虽然可借助三维模板或参数化模型独立追踪多个非刚体物体，但此类方法产生的三维估计会呈现深度模糊、物体间非自然碰撞及形变缺失或失真等多重伪影。为此，本文首次提出解决上述根本性挑战的方法，能够从单目RGB视频中实现对人类手部与人脸交互的三维追踪。我们将手部建模为主动交互中诱发人脸非刚体形变的铰接式物体。该方法依赖全新的人脸-手部运动与交互捕捉数据集，该数据集通过无标记多视角相机系统采集真实面部形变。在数据创建的关键环节，我们采用基于位置的动力学处理方法以及头部组织非均匀刚度估计技术处理重建的原始三维形状，从而获得表面形变、手脸接触区域及头手位置等合理标注。我们神经方法的核心包含提供手脸深度先验的变分自编码器，以及通过估计接触与形变引导三维追踪的模块。相较于多个适用基线方法，最终的三维手部与人脸重建在定量与定性评估中均展现出更优的现实性与合理性。https://vcai.mpi-inf.mpg.de/projects/Decaf