Recent studies have shown impressive performance in Lip-to-speech synthesis that aims to reconstruct speech from visual information alone. However, they have been suffering from synthesizing accurate speech in the wild, due to insufficient supervision for guiding the model to infer the correct content. Distinct from the previous methods, in this paper, we develop a powerful Lip2Speech method that can reconstruct speech with correct contents from the input lip movements, even in a wild environment. To this end, we design multi-task learning that guides the model using multimodal supervision, i.e., text and audio, to complement the insufficient word representations of acoustic feature reconstruction loss. Thus, the proposed framework brings the advantage of synthesizing speech containing the right content of multiple speakers with unconstrained sentences. We verify the effectiveness of the proposed method using LRS2, LRS3, and LRW datasets.
翻译:近年研究表明,唇语转语音合成(从视觉信息重建语音)已取得显著成效。然而,由于缺乏引导模型推断正确内容的充分监督,现有方法在野外环境下的高精度语音合成仍面临挑战。与以往方法不同,本文提出一种强大的Lip2Speech方法,即使在复杂野外环境中,也能从输入唇部运动中重建包含正确内容的语音。为此,我们设计了多任务学习框架,通过多模态监督(即文本与音频)引导模型,弥补声学特征重建损失中词汇表征不足的问题。该框架的优势在于能够合成包含多位说话人任意语句的正确语音内容。我们通过LRS2、LRS3和LRW数据集验证了所提方法的有效性。