Audio-driven talking face generation aims to synthesize video with lip movements synchronized to input audio. However, current generative techniques face challenges in preserving intricate regional textures (skin, teeth). To address the aforementioned challenges, we propose a novel framework called SegTalker to decouple lip movements and image textures by introducing segmentation as intermediate representation. Specifically, given the mask of image employed by a parsing network, we first leverage the speech to drive the mask and generate talking segmentation. Then we disentangle semantic regions of image into style codes using a mask-guided encoder. Ultimately, we inject the previously generated talking segmentation and style codes into a mask-guided StyleGAN to synthesize video frame. In this way, most of textures are fully preserved. Moreover, our approach can inherently achieve background separation and facilitate mask-guided facial local editing. In particular, by editing the mask and swapping the region textures from a given reference image (e.g. hair, lip, eyebrows), our approach enables facial editing seamlessly when generating talking face video. Experiments demonstrate that our proposed approach can effectively preserve texture details and generate temporally consistent video while remaining competitive in lip synchronization. Quantitative and qualitative results on the HDTF and MEAD datasets illustrate the superior performance of our method over existing methods.
翻译:音频驱动的说话人脸生成旨在合成与输入音频同步的唇部运动的视频。然而,当前的生成技术在保留精细区域纹理(皮肤、牙齿)方面面临挑战。为解决上述挑战,我们提出了一种名为SegTalker的新框架,通过引入分割作为中间表示来解耦唇部运动和图像纹理。具体而言,给定由解析网络使用的图像掩码,我们首先利用语音驱动该掩码并生成说话分割。然后,我们使用掩码引导的编码器将图像的语义区域解耦为风格编码。最终,我们将先前生成的说话分割和风格编码注入到掩码引导的StyleGAN中以合成视频帧。通过这种方式,大部分纹理得以完整保留。此外,我们的方法能够固有地实现背景分离,并促进掩码引导的面部局部编辑。特别地,通过编辑掩码并交换来自给定参考图像的区域纹理(例如头发、嘴唇、眉毛),我们的方法能够在生成说话人脸视频时无缝实现面部编辑。实验表明,我们提出的方法能够有效保留纹理细节并生成时间一致的视频,同时在唇部同步方面保持竞争力。在HDTF和MEAD数据集上的定量和定性结果证明了我们的方法相对于现有方法的优越性能。