Decaf: Monocular Deformation Capture for Face and Hand Interactions

Existing methods for 3D tracking from monocular RGB videos predominantly consider articulated and rigid objects. Modelling dense non-rigid object deformations in this setting remained largely unaddressed so far, although such effects can improve the realism of the downstream applications such as AR/VR and avatar communications. This is due to the severe ill-posedness of the monocular view setting and the associated challenges. While it is possible to naively track multiple non-rigid objects independently using 3D templates or parametric 3D models, such an approach would suffer from multiple artefacts in the resulting 3D estimates such as depth ambiguity, unnatural intra-object collisions and missing or implausible deformations. Hence, this paper introduces the first method that addresses the fundamental challenges depicted above and that allows tracking human hands interacting with human faces in 3D from single monocular RGB videos. We model hands as articulated objects inducing non-rigid face deformations during an active interaction. Our method relies on a new hand-face motion and interaction capture dataset with realistic face deformations acquired with a markerless multi-view camera system. As a pivotal step in its creation, we process the reconstructed raw 3D shapes with position-based dynamics and an approach for non-uniform stiffness estimation of the head tissues, which results in plausible annotations of the surface deformations, hand-face contact regions and head-hand positions. At the core of our neural approach are a variational auto-encoder supplying the hand-face depth prior and modules that guide the 3D tracking by estimating the contacts and the deformations. Our final 3D hand and face reconstructions are realistic and more plausible compared to several baselines applicable in our setting, both quantitatively and qualitatively. https://vcai.mpi-inf.mpg.de/projects/Decaf

翻译：摘要：现有基于单目RGB视频的三维追踪方法主要考虑关节物体和刚体物体。在此场景下对稠密非刚体形变进行建模，过去始终未得到有效解决，尽管此类形变能提升AR/VR及虚拟化身通信等下游应用的真实感。这源于单目视角设置的严重病态性及其伴随挑战。虽然可借助三维模板或参数化三维模型独立追踪多个非刚体物体，但此类方法在三维估计结果中易出现深度模糊、物体间非自然碰撞、缺失或不可信形变等多种伪影。为此，本文首次提出解决上述根本性挑战的方法，实现从单目RGB视频中追踪人脸与手部三维交互。我们将手部建模为在主动交互过程中引发人脸非刚体形变的关节物体。该方法依赖全新的手-脸运动与交互捕捉数据集，该数据集通过无标记多视角相机系统采集真实面部形变。在数据创建的关键步骤中，我们采用基于位置的动力学和非均匀头组织刚度估计方法处理重建的原始三维形状，从而获得面部形变、手-脸接触区域及头-手位置的可信标注。该神经方法的核心包含：提供手-脸深度先验的变分自编码器，以及通过估计接触与形变引导三维追踪的模块。与适用于本场景的多个基线方法相比，最终的三维手部和面部重建结果在定量和定性层面均更真实可信。https://vcai.mpi-inf.mpg.de/projects/Decaf