The goal of this work is to reconstruct high quality speech from lip motions alone, a task also known as lip-to-speech. A key challenge of lip-to-speech systems is the one-to-many mapping caused by (1) the existence of homophenes and (2) multiple speech variations, resulting in a mispronounced and over-smoothed speech. In this paper, we propose a novel lip-to-speech system that significantly improves the generation quality by alleviating the one-to-many mapping problem from multiple perspectives. Specifically, we incorporate (1) self-supervised speech representations to disambiguate homophenes, and (2) acoustic variance information to model diverse speech styles. Additionally, to better solve the aforementioned problem, we employ a flow based post-net which captures and refines the details of the generated speech. We perform extensive experiments and demonstrate that our method achieves the generation quality close to that of real human utterance, outperforming existing methods in terms of speech naturalness and intelligibility by a large margin. Synthesised samples are available at the anonymous demo page: https://mm.kaist.ac.kr/projects/LTBS.
翻译:本研究的目标是仅从唇部运动重建高质量语音,这一任务也被称为唇语到语音。唇语到语音系统的一个关键挑战是由(1)同音词的存在和(2)多种语音变化引起的一对多映射问题,这会导致发音错误和过度平滑的语音。本文提出了一种新颖的唇语到语音系统,通过从多个角度缓解一对多映射问题,显著提升了生成质量。具体来说,我们引入了(1)自监督语音表征以消除同音词的歧义,以及(2)声学方差信息以建模多样化的语音风格。此外,为了更好地解决上述问题,我们采用了一种基于流的后处理网络,用于捕捉并细化生成语音的细节。我们进行了大量实验,结果表明我们的方法达到了接近真实人类语音的生成质量,在语音自然度和可理解性上大幅超越了现有方法。合成样本可访问匿名演示页面获取:https://mm.kaist.ac.kr/projects/LTBS。