Speech-to-face generation is an intriguing area of research that focuses on generating realistic facial images based on a speaker's audio speech. However, state-of-the-art methods employing GAN-based architectures lack stability and cannot generate realistic face images. To fill this gap, we propose a novel speech-to-face generation framework, which leverages a Speech-Conditioned Latent Diffusion Model, called SCLDM. To the best of our knowledge, this is the first work to harness the exceptional modeling capabilities of diffusion models for speech-to-face generation. Preserving the shared identity information between speech and face is crucial in generating realistic results. Therefore, we employ contrastive pre-training for both the speech encoder and the face encoder. This pre-training strategy facilitates effective alignment between the attributes of speech, such as age and gender, and the corresponding facial characteristics in the face images. Furthermore, we tackle the challenge posed by excessive diversity in the synthesis process caused by the diffusion model. To overcome this challenge, we introduce the concept of residuals by integrating a statistical face prior to the diffusion process. This addition helps to eliminate the shared component across the faces and enhances the subtle variations captured by the speech condition. Extensive quantitative, qualitative, and user study experiments demonstrate that our method can produce more realistic face images while preserving the identity of the speaker better than state-of-the-art methods. Highlighting the notable enhancements, our method demonstrates significant gains in all metrics on the AVSpeech dataset and Voxceleb dataset, particularly noteworthy are the improvements of 32.17 and 32.72 on the cosine distance metric for the two datasets, respectively.
翻译:语音到人脸生成是一个引人入胜的研究领域,旨在根据说话者的语音音频生成逼真的面部图像。然而,采用基于GAN架构的现有最优方法存在稳定性不足的问题,且无法生成真实感的人脸图像。为弥补这一不足,我们提出了一种新颖的语音到人脸生成框架,该框架利用了一种称为SCLDM的语音条件潜扩散模型。据我们所知,这是首次利用扩散模型卓越的建模能力进行语音到人脸生成的研究。保留语音与人脸之间的共享身份信息对于生成逼真结果至关重要。因此,我们对语音编码器和人脸编码器采用了对比预训练策略。这种预训练策略有助于有效对齐语音属性(如年龄和性别)与人脸图像中相应的面部特征。此外,我们解决了扩散模型在合成过程中导致的过度多样性问题。为克服这一挑战,我们通过将统计面部先验融入扩散过程来引入残差概念。这一改进有助于消除人脸间的共享成分,并增强语音条件捕捉到的细微变化。大量量化、定性及用户研究实验表明,与现有最优方法相比,我们的方法能在更好保留说话者身份的同时生成更逼真的人脸图像。值得关注的是,我们的方法在AVSpeech数据集和Voxceleb数据集上的所有指标均有显著提升,其中两个数据集的余弦距离指标分别提升了32.17和32.72,尤为引人注目。