This paper proposes a talking face generation method named "CP-EB" that takes an audio signal as input and a person image as reference, to synthesize a photo-realistic people talking video with head poses controlled by a short video clip and proper eye blinking embedding. It's noted that not only the head pose but also eye blinking are both important aspects for deep fake detection. The implicit control of poses by video has already achieved by the state-of-art work. According to recent research, eye blinking has weak correlation with input audio which means eye blinks extraction from audio and generation are possible. Hence, we propose a GAN-based architecture to extract eye blink feature from input audio and reference video respectively and employ contrastive training between them, then embed it into the concatenated features of identity and poses to generate talking face images. Experimental results show that the proposed method can generate photo-realistic talking face with synchronous lips motions, natural head poses and blinking eyes.
翻译:摘要:本文提出一种名为“CP-EB”的说话人脸生成方法,该方法以音频信号为输入、人物图像为参考,通过短视频片段控制头部姿态并嵌入恰当的眨眼动作,合成逼真的人物说话视频。值得注意的是,头部姿态与眨眼动作对于深度伪造检测均至关重要。现有顶尖研究已实现通过视频隐式控制姿态。近期研究表明,眨眼与输入音频关联性较弱,这意味着从音频中提取并生成眨眼特征具有可行性。因此,我们提出基于生成对抗网络(GAN)的架构,分别从输入音频和参考视频中提取眨眼特征,并进行对比训练,随后将眨眼特征嵌入身份与姿态的拼接特征中,生成说话人脸图像。实验结果表明,该方法可生成嘴唇运动同步、头部姿态自然且包含眨眼动作的逼真说话人脸。