We propose a method to learn a high-quality implicit 3D head avatar from a monocular RGB video captured in the wild. The learnt avatar is driven by a parametric face model to achieve user-controlled facial expressions and head poses. Our hybrid pipeline combines the geometry prior and dynamic tracking of a 3DMM with a neural radiance field to achieve fine-grained control and photorealism. To reduce over-smoothing and improve out-of-model expressions synthesis, we propose to predict local features anchored on the 3DMM geometry. These learnt features are driven by 3DMM deformation and interpolated in 3D space to yield the volumetric radiance at a designated query point. We further show that using a Convolutional Neural Network in the UV space is critical in incorporating spatial context and producing representative local features. Extensive experiments show that we are able to reconstruct high-quality avatars, with more accurate expression-dependent details, good generalization to out-of-training expressions, and quantitatively superior renderings compared to other state-of-the-art approaches.
翻译:我们提出了一种方法,用于从野外采集的单目RGB视频中学习高质量隐式3D头部化身。所学化身由参数化人脸模型驱动,以实现用户可控的面部表情和头部姿态。我们的混合流水线结合了3DMM的几何先验与动态跟踪以及神经辐射场,从而实现了精细控制与照片级真实感。为了减少过度平滑并改善模型外表情的合成,我们提出预测锚定在3DMM几何上的局部特征。这些学习到的特征由3DMM形变驱动,并在三维空间中插值,以在指定查询点处生成体积辐射度。我们进一步证明,在UV空间中使用卷积神经网络对于整合空间上下文并生成具有代表性的局部特征至关重要。大量实验表明,我们能够重建高质量化身,其具有更准确的表情相关细节、对训练外表情的良好泛化能力,并且在渲染质量上定量优于其他最先进的方法。