Facetron: A Multi-speaker Face-to-Speech Model based on Cross-modal Latent Representations

In this paper, we propose a multi-speaker face-to-speech waveform generation model that also works for unseen speaker conditions. Using a generative adversarial network (GAN) with linguistic and speaker characteristic features as auxiliary conditions, our method directly converts face images into speech waveforms under an end-to-end training framework. The linguistic features are extracted from lip movements using a lip-reading model, and the speaker characteristic features are predicted from face images using cross-modal learning with a pre-trained acoustic model. Since these two features are uncorrelated and controlled independently, we can flexibly synthesize speech waveforms whose speaker characteristics vary depending on the input face images. We show the superiority of our proposed model over conventional methods in terms of objective and subjective evaluation results. Specifically, we evaluate the performances of linguistic features by measuring their accuracy on an automatic speech recognition task. In addition, we estimate speaker and gender similarity for multi-speaker and unseen conditions, respectively. We also evaluate the aturalness of the synthesized speech waveforms using a mean opinion score (MOS) test and non-intrusive objective speech quality assessment (NISQA).The demo samples of the proposed and other models are available at https://sam-0927.github.io/

翻译：本文提出了一种多说话人面部转语音波形生成模型，该模型同样适用于未见说话人条件。通过采用生成对抗网络（GAN）并以语言特征和说话人特征作为辅助条件，我们的方法能够在端到端训练框架下直接将面部图像转换为语音波形。语言特征通过唇读模型从唇部运动中提取，说话人特征则利用预训练声学模型通过跨模态学习从面部图像中预测得出。由于这两种特征互不相关且可独立控制，我们能够灵活地合成语音波形，其说话人特征随输入面部图像而变化。我们通过客观和主观评估结果展示了所提模型相较于传统方法的优越性。具体而言，我们通过自动语音识别任务中的准确率来评估语言特征的性能。此外，我们分别评估了多说话人条件和未见条件下说话人与性别的相似度。我们还采用平均意见分（MOS）测试和非侵入式客观语音质量评估（NISQA）来评估合成语音波形的自然度。所提模型及其他模型的演示样本可在 https://sam-0927.github.io/ 获取。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

ICLR 2022杰出论文公布：7篇论文获得，清华朱军课题组摘得

专知会员服务

60+阅读 · 2022年4月22日

计算机科学课程与视频课件合集，Computer Science courses with video lectures

专知会员服务

37+阅读 · 2022年1月24日

史上最全！358篇机器学习&自然语言处理综述论文！都这儿了

专知会员服务

129+阅读 · 2020年7月18日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

167+阅读 · 2020年3月18日