Given an audio clip and a reference face image, the goal of the talking head generation is to generate a high-fidelity talking head video. Although some audio-driven methods of generating talking head videos have made some achievements in the past, most of them only focused on lip and audio synchronization and lack the ability to reproduce the facial expressions of the target person. To this end, we propose a talking head generation model consisting of a Memory-Sharing Emotion Feature extractor (MSEF) and an Attention-Augmented Translator based on U-net (AATU). Firstly, MSEF can extract implicit emotional auxiliary features from audio to estimate more accurate emotional face landmarks.~Secondly, AATU acts as a translator between the estimated landmarks and the photo-realistic video frames. Extensive qualitative and quantitative experiments have shown the superiority of the proposed method to the previous works. Codes will be made publicly available.
翻译:给定一段音频和一张参考人脸图像,说话头生成的目标是生成高保真的说话头视频。尽管过去一些基于音频驱动的说话头视频生成方法已取得一定进展,但大多仅关注唇部与音频的同步,缺乏复现目标人物面部表情的能力。为此,我们提出一种由记忆共享情感特征提取器(MSEF)和基于U-net的注意力增强翻译器(AATU)组成的说话头生成模型。首先,MSEF能够从音频中提取隐式情感辅助特征,以估计更准确的情感面部关键点。其次,AATU作为估计关键点与照片级真实视频帧之间的翻译器。大量定性与定量实验表明,所提方法优于以往工作。代码将公开提供。