Predominant techniques on talking head generation largely depend on 2D information, including facial appearances and motions from input face images. Nevertheless, dense 3D facial geometry, such as pixel-wise depth, plays a critical role in constructing accurate 3D facial structures and suppressing complex background noises for generation. However, dense 3D annotations for facial videos is prohibitively costly to obtain. In this work, firstly, we present a novel self-supervised method for learning dense 3D facial geometry (ie, depth) from face videos, without requiring camera parameters and 3D geometry annotations in training. We further propose a strategy to learn pixel-level uncertainties to perceive more reliable rigid-motion pixels for geometry learning. Secondly, we design an effective geometry-guided facial keypoint estimation module, providing accurate keypoints for generating motion fields. Lastly, we develop a 3D-aware cross-modal (ie, appearance and depth) attention mechanism, which can be applied to each generation layer, to capture facial geometries in a coarse-to-fine manner. Extensive experiments are conducted on three challenging benchmarks (ie, VoxCeleb1, VoxCeleb2, and HDTF). The results demonstrate that our proposed framework can generate highly realistic-looking reenacted talking videos, with new state-of-the-art performances established on these benchmarks. The codes and trained models are publicly available on the GitHub project page at https://github.com/harlanhong/CVPR2022-DaGAN
翻译:当前主流的面部动作生成技术主要依赖二维信息,包括输入人脸图像中的面部外观与运动特征。然而,稠密的三维面部几何信息(如逐像素深度)在构建精确的三维面部结构、抑制生成过程中复杂背景噪声方面具有关键作用。但面部视频的稠密三维标注成本高昂。本文首先提出一种新颖的自监督方法,无需训练时提供相机参数或三维几何标注,即可从面部视频中学习稠密三维面部几何(即深度)。同时,我们设计了一种像素级不确定性学习策略,以感知更可靠的刚体运动像素用于几何学习。其次,我们构建了高效的几何引导面部关键点估计模块,为运动场生成提供精确关键点。最后,提出三维感知的跨模态(即外观与深度)注意力机制,可应用于各生成层,通过由粗到精的方式捕捉面部几何特征。在三个具有挑战性的基准数据集(VoxCeleb1、VoxCeleb2和HDTF)上开展大量实验,结果表明,我们的框架能生成高度逼真的重演说话视频,并在这些基准上创下新的最优性能。代码与预训练模型已在GitHub项目页面(https://github.com/harlanhong/CVPR2022-DaGAN)公开。