The goal of this work is to reconstruct high quality speech from lip motions alone, a task also known as lip-to-speech. A key challenge of lip-to-speech systems is the one-to-many mapping caused by (1) the existence of homophenes and (2) multiple speech variations, resulting in a mispronounced and over-smoothed speech. In this paper, we propose a novel lip-to-speech system that significantly improves the generation quality by alleviating the one-to-many mapping problem from multiple perspectives. Specifically, we incorporate (1) self-supervised speech representations to disambiguate homophenes, and (2) acoustic variance information to model diverse speech styles. Additionally, to better solve the aforementioned problem, we employ a flow based post-net which captures and refines the details of the generated speech. We perform extensive experiments on two datasets, and demonstrate that our method achieves the generation quality close to that of real human utterance, outperforming existing methods in terms of speech naturalness and intelligibility by a large margin. Synthesised samples are available at our demo page: https://mm.kaist.ac.kr/projects/LTBS.
翻译:本工作的目标是从唇部运动单独重建高质量语音,这一任务也被称为唇到语音。唇到语音系统的一个关键挑战是由(1)同音词的存在和(2)多种语音变体引起的一对多映射问题,导致发音错误且过度平滑的语音生成。本文提出了一种新颖的唇到语音系统,通过多角度缓解一对多映射问题,显著提升了生成质量。具体而言,我们融合了(1)自监督语音表示以消除同音词的歧义,以及(2)声学方差信息以建模多样的语音风格。此外,为更好地解决上述问题,我们采用基于流的后处理网络来捕获并精炼生成语音的细节。我们在两个数据集上进行了广泛实验,结果表明,我们的方法达到了接近真实人类发音的生成质量,在语音自然度和清晰度方面大幅优于现有方法。合成样本可访问我们的演示页面:https://mm.kaist.ac.kr/projects/LTBS。