Realistic Speech-to-Face Generation with Speech-Conditioned Latent Diffusion Model with Face Prior

Speech-to-face generation is an intriguing area of research that focuses on generating realistic facial images based on a speaker's audio speech. However, state-of-the-art methods employing GAN-based architectures lack stability and cannot generate realistic face images. To fill this gap, we propose a novel speech-to-face generation framework, which leverages a Speech-Conditioned Latent Diffusion Model, called SCLDM. To the best of our knowledge, this is the first work to harness the exceptional modeling capabilities of diffusion models for speech-to-face generation. Preserving the shared identity information between speech and face is crucial in generating realistic results. Therefore, we employ contrastive pre-training for both the speech encoder and the face encoder. This pre-training strategy facilitates effective alignment between the attributes of speech, such as age and gender, and the corresponding facial characteristics in the face images. Furthermore, we tackle the challenge posed by excessive diversity in the synthesis process caused by the diffusion model. To overcome this challenge, we introduce the concept of residuals by integrating a statistical face prior to the diffusion process. This addition helps to eliminate the shared component across the faces and enhances the subtle variations captured by the speech condition. Extensive quantitative, qualitative, and user study experiments demonstrate that our method can produce more realistic face images while preserving the identity of the speaker better than state-of-the-art methods. Highlighting the notable enhancements, our method demonstrates significant gains in all metrics on the AVSpeech dataset and Voxceleb dataset, particularly noteworthy are the improvements of 32.17 and 32.72 on the cosine distance metric for the two datasets, respectively.

翻译：语音到人脸生成是一个引人入胜的研究领域，旨在根据说话者的语音音频生成逼真的面部图像。然而，采用基于GAN架构的现有最优方法存在稳定性不足的问题，且无法生成真实感的人脸图像。为弥补这一不足，我们提出了一种新颖的语音到人脸生成框架，该框架利用了一种称为SCLDM的语音条件潜扩散模型。据我们所知，这是首次利用扩散模型卓越的建模能力进行语音到人脸生成的研究。保留语音与人脸之间的共享身份信息对于生成逼真结果至关重要。因此，我们对语音编码器和人脸编码器采用了对比预训练策略。这种预训练策略有助于有效对齐语音属性（如年龄和性别）与人脸图像中相应的面部特征。此外，我们解决了扩散模型在合成过程中导致的过度多样性问题。为克服这一挑战，我们通过将统计面部先验融入扩散过程来引入残差概念。这一改进有助于消除人脸间的共享成分，并增强语音条件捕捉到的细微变化。大量量化、定性及用户研究实验表明，与现有最优方法相比，我们的方法能在更好保留说话者身份的同时生成更逼真的人脸图像。值得关注的是，我们的方法在AVSpeech数据集和Voxceleb数据集上的所有指标均有显著提升，其中两个数据集的余弦距离指标分别提升了32.17和32.72，尤为引人注目。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日