Generating talking head videos through a face image and a piece of speech audio still contains many challenges. ie, unnatural head movement, distorted expression, and identity modification. We argue that these issues are mainly because of learning from the coupled 2D motion fields. On the other hand, explicitly using 3D information also suffers problems of stiff expression and incoherent video. We present SadTalker, which generates 3D motion coefficients (head pose, expression) of the 3DMM from audio and implicitly modulates a novel 3D-aware face render for talking head generation. To learn the realistic motion coefficients, we explicitly model the connections between audio and different types of motion coefficients individually. Precisely, we present ExpNet to learn the accurate facial expression from audio by distilling both coefficients and 3D-rendered faces. As for the head pose, we design PoseVAE via a conditional VAE to synthesize head motion in different styles. Finally, the generated 3D motion coefficients are mapped to the unsupervised 3D keypoints space of the proposed face render, and synthesize the final video. We conducted extensive experiments to demonstrate the superiority of our method in terms of motion and video quality.
翻译:通过单张人脸图像和一段语音音频生成说话人头视频仍面临诸多挑战,如不自然的头部运动、扭曲的表情以及身份特征改变。我们认为这些问题主要源于从耦合的2D运动场中学习。另一方面,显式使用3D信息也存在表情僵硬和视频不连贯的问题。我们提出SadTalker,该方法从音频中生成3DMM的3D运动系数(头部姿态、表情),并隐式调节新型3D感知人脸渲染器以生成说话人头。为学习真实的运动系数,我们分别显式建模音频与不同类型运动系数间的关联。具体而言,我们提出ExpNet通过蒸馏系数和3D渲染人脸来学习音频驱动的精准面部表情。针对头部姿态,我们设计基于条件VAE的PoseVAE以合成不同风格的头部运动。最终,生成的3D运动系数被映射到所提人脸渲染器的无监督3D关键点空间,并合成最终视频。我们进行了大量实验,证明了本方法在运动质量和视频质量上的优越性。