Talking face generation has a wide range of potential applications in the field of virtual digital humans. However, rendering high-fidelity facial video while ensuring lip synchronization is still a challenge for existing audio-driven talking face generation approaches. To address this issue, we propose HyperLips, a two-stage framework consisting of a hypernetwork for controlling lips and a high-resolution decoder for rendering high-fidelity faces.In the first stage, we construct a base face generation network that uses the hypernetwork to control the encoding latent code of the visual face information over audio. First, FaceEncoder is used to obtain latent code by extracting features from the visual face information taken from the video source containing the face frame.Then, HyperConv, which weighting parameters are updated by HyperNet with the audio features as input, will modify the latent code to synchronize the lip movement with the audio. Finally, FaceDecoder will decode the modified and synchronized latent code into visual face content. In the second stage, we obtain higher quality face videos through a high-resolution decoder. To further improve the quality of face generation, we trained a high-resolution decoder, HRDecoder, using face images and detected sketches generated from the first stage as input.Extensive quantitative and qualitative experiments show that our method outperforms state-of-the-art work with more realistic, high-fidelity, and lip synchronization. Project page: https://semchan.github.io/HyperLips/
翻译:说话人脸生成在虚拟数字人领域具有广泛的应用潜力。然而,在确保唇部同步的同时渲染高保真人脸视频,仍是现有音频驱动说话人脸生成方法面临的挑战。为此,我们提出HyperLips,一个包含超网络控制嘴唇和高分辨率解码器渲染高保真人脸的两阶段框架。在第一阶段,我们构建基础人脸生成网络,利用超网络控制音频对应的视觉人脸信息编码潜码。首先,FaceEncoder通过提取包含人脸帧的视频源中的视觉人脸信息特征来获取潜码。随后,以音频特征为输入、由HyperNet更新权重参数的HyperConv将修改潜码以同步唇部运动与音频。最后,FaceDecoder将修改并同步后的潜码解码为视觉人脸内容。在第二阶段,我们通过高分辨率解码器获取更高质量的人脸视频。为进一步提升人脸生成质量,我们利用第一阶段生成的人脸图像和检测草图作为输入,训练高分辨率解码器HRDecoder。大量定量和定性实验表明,我们的方法在逼真度、高保真度和唇部同步方面均优于现有最优方法。项目页面:https://semchan.github.io/HyperLips/