Lip-to-speech synthesis aims to generate speech audio directly from silent facial video by reconstructing linguistic content from lip movements, providing valuable applications in situations where audio signals are unavailable or degraded. While recent diffusion-based models such as LipVoicer have demonstrated impressive performance in reconstructing linguistic content, they often lack prosodic consistency. In this work, we propose LipSody, a lip-to-speech framework enhanced for prosody consistency. LipSody introduces a prosody-guiding strategy that leverages three complementary cues: speaker identity extracted from facial images, linguistic content derived from lip movements, and emotional context inferred from face video. Experimental results demonstrate that LipSody substantially improves prosody-related metrics, including global and local pitch deviations, energy consistency, and speaker similarity, compared to prior approaches.
翻译:唇语转语音合成旨在通过从唇部动作重建语言内容,直接从无声面部视频生成语音音频,在音频信号不可用或受损的情况下提供有价值的应用。尽管近期基于扩散的模型(如LipVoicer)在重建语言内容方面表现出色,但它们往往缺乏韵律一致性。本研究提出LipSody——一个增强韵律一致性的唇语转语音框架。LipSody引入了一种韵律引导策略,该策略利用三个互补线索:从面部图像提取的说话人身份、从唇部动作推导的语言内容,以及从面部视频推断的情感上下文。实验结果表明,与现有方法相比,LipSody在全局与局部音高偏差、能量一致性和说话人相似度等韵律相关指标上均有显著提升。