The goal of this work is zero-shot text-to-speech synthesis, with speaking styles and voices learnt from facial characteristics. Inspired by the natural fact that people can imagine the voice of someone when they look at his or her face, we introduce a face-styled diffusion text-to-speech (TTS) model within a unified framework learnt from visible attributes, called Face-TTS. This is the first time that face images are used as a condition to train a TTS model. We jointly train cross-model biometrics and TTS models to preserve speaker identity between face images and generated speech segments. We also propose a speaker feature binding loss to enforce the similarity of the generated and the ground truth speech segments in speaker embedding space. Since the biometric information is extracted directly from the face image, our method does not require extra fine-tuning steps to generate speech from unseen and unheard speakers. We train and evaluate the model on the LRS3 dataset, an in-the-wild audio-visual corpus containing background noise and diverse speaking styles. The project page is https://facetts.github.io.
翻译:本文旨在实现零样本文本到语音合成,其说话风格和语音特征从面部特征中学习。受人们看到他人面孔时能想象其声音这一自然现象的启发,我们提出了一种统一框架下的面部风格扩散文本到语音(TTS)模型,该模型从可见属性中学习,称为Face-TTS。这是首次将人脸图像作为条件来训练TTS模型。我们联合训练跨模态生物特征模型与TTS模型,以保持人脸图像与生成语音片段之间的说话者身份一致性。我们还提出了一种说话者特征绑定损失函数,以强制生成语音片段与真实语音片段在说话者嵌入空间中的相似性。由于生物特征信息直接从人脸图像中提取,我们的方法无需额外的微调步骤即可从未见过且未听过的说话者生成语音。我们在LRS3数据集上训练并评估模型,该数据集是一个包含背景噪声和多样化说话风格的真实场景音视频语料库。项目页面为https://facetts.github.io。