Researchers have shown a growing interest in Audio-driven Talking Head Generation. The primary challenge in talking head generation is achieving audio-visual coherence between the lips and the audio, known as lip synchronization. This paper proposes a generic method, LPIPS-AttnWav2Lip, for reconstructing face images of any speaker based on audio. We used the U-Net architecture based on residual CBAM to better encode and fuse audio and visual modal information. Additionally, the semantic alignment module extends the receptive field of the generator network to obtain the spatial and channel information of the visual features efficiently; and match statistical information of visual features with audio latent vector to achieve the adjustment and injection of the audio content information to the visual information. To achieve exact lip synchronization and to generate realistic high-quality images, our approach adopts LPIPS Loss, which simulates human judgment of image quality and reduces instability possibility during the training process. The proposed method achieves outstanding performance in terms of lip synchronization accuracy and visual quality as demonstrated by subjective and objective evaluation results. The code for the paper is available at the following link: https://github.com/FelixChan9527/LPIPS-AttnWav2Lip
翻译:研究人员对音频驱动说话头生成的兴趣日益增长。说话头生成的主要挑战在于实现唇部与音频之间的视听一致性,即唇形同步。本文提出了一种通用方法LPIPS-AttnWav2Lip,用于根据任意说话者的音频重建其人脸图像。我们采用基于残差CBAM的U-Net架构,以更好地编码和融合音频与视觉模态信息。此外,语义对齐模块扩展了生成器网络的感受野,以高效获取视觉特征的空间与通道信息;并将视觉特征的统计信息与音频潜在向量进行匹配,以实现音频内容信息对视觉信息的调整与注入。为实现精确的唇形同步并生成逼真的高质量图像,我们的方法采用了LPIPS损失,该损失模拟了人类对图像质量的判断,并降低了训练过程中的不稳定性。主观与客观评估结果表明,所提方法在唇形同步准确性和视觉质量方面均取得了优异性能。本文代码可在以下链接获取:https://github.com/FelixChan9527/LPIPS-AttnWav2Lip