In this paper, we introduce a novel approach to address the task of synthesizing speech from silent videos of any in-the-wild speaker solely based on lip movements. The traditional approach of directly generating speech from lip videos faces the challenge of not being able to learn a robust language model from speech alone, resulting in unsatisfactory outcomes. To overcome this issue, we propose incorporating noisy text supervision using a state-of-the-art lip-to-text network that instills language information into our model. The noisy text is generated using a pre-trained lip-to-text model, enabling our approach to work without text annotations during inference. We design a visual text-to-speech network that utilizes the visual stream to generate accurate speech, which is in-sync with the silent input video. We perform extensive experiments and ablation studies, demonstrating our approach's superiority over the current state-of-the-art methods on various benchmark datasets. Further, we demonstrate an essential practical application of our method in assistive technology by generating speech for an ALS patient who has lost the voice but can make mouth movements. Our demo video, code, and additional details can be found at \url{http://cvit.iiit.ac.in/research/projects/cvit-projects/ms-l2s-itw}.
翻译:本文提出了一种新颖方法,旨在仅基于唇部运动,从任意野外场景下说话者的无声视频中合成语音。直接从唇部视频生成语音的传统方法面临无法仅从语音中学习到鲁棒语言模型的挑战,导致合成效果不理想。为解决这一问题,我们引入了一种利用先进唇语到文本网络的带噪文本监督机制,将语言信息融入模型。该带噪文本通过预训练的唇语到文本模型生成,使得我们的方法在推理过程中无需文本标注。我们设计了一种视觉文本到语音网络,它利用视觉流生成与无声输入视频同步的精确语音。通过大量实验和消融研究,我们证明了该方法的性能在多个基准数据集上超越了现有最先进方法。此外,我们还展示了该方法在辅助技术中的重要实际应用——为一位失去发声能力但仍能做出口部动作的肌萎缩侧索硬化症(ALS)患者生成语音。我们的演示视频、代码及更多细节可在\url{http://cvit.iiit.ac.in/research/projects/cvit-projects/ms-l2s-itw}获取。