In this paper, we propose a multi-speaker face-to-speech waveform generation model that also works for unseen speaker conditions. Using a generative adversarial network (GAN) with linguistic and speaker characteristic features as auxiliary conditions, our method directly converts face images into speech waveforms under an end-to-end training framework. The linguistic features are extracted from lip movements using a lip-reading model, and the speaker characteristic features are predicted from face images using cross-modal learning with a pre-trained acoustic model. Since these two features are uncorrelated and controlled independently, we can flexibly synthesize speech waveforms whose speaker characteristics vary depending on the input face images. We show the superiority of our proposed model over conventional methods in terms of objective and subjective evaluation results. Specifically, we evaluate the performances of linguistic features by measuring their accuracy on an automatic speech recognition task. In addition, we estimate speaker and gender similarity for multi-speaker and unseen conditions, respectively. We also evaluate the aturalness of the synthesized speech waveforms using a mean opinion score (MOS) test and non-intrusive objective speech quality assessment (NISQA).The demo samples of the proposed and other models are available at https://sam-0927.github.io/
翻译:本文提出了一种多说话人面部转语音波形生成模型,该模型同样适用于未见说话人条件。通过采用生成对抗网络(GAN)并以语言特征和说话人特征作为辅助条件,我们的方法能够在端到端训练框架下直接将面部图像转换为语音波形。语言特征通过唇读模型从唇部运动中提取,说话人特征则利用预训练声学模型通过跨模态学习从面部图像中预测得出。由于这两种特征互不相关且可独立控制,我们能够灵活地合成语音波形,其说话人特征随输入面部图像而变化。我们通过客观和主观评估结果展示了所提模型相较于传统方法的优越性。具体而言,我们通过自动语音识别任务中的准确率来评估语言特征的性能。此外,我们分别评估了多说话人条件和未见条件下说话人与性别的相似度。我们还采用平均意见分(MOS)测试和非侵入式客观语音质量评估(NISQA)来评估合成语音波形的自然度。所提模型及其他模型的演示样本可在 https://sam-0927.github.io/ 获取。