Significant progress has been made in speaker dependent Lip-to-Speech synthesis, which aims to generate speech from silent videos of talking faces. Current state-of-the-art approaches primarily employ non-autoregressive sequence-to-sequence architectures to directly predict mel-spectrograms or audio waveforms from lip representations. We hypothesize that the direct mel-prediction hampers training/model efficiency due to the entanglement of speech content with ambient information and speaker characteristics. To this end, we propose RobustL2S, a modularized framework for Lip-to-Speech synthesis. First, a non-autoregressive sequence-to-sequence model maps self-supervised visual features to a representation of disentangled speech content. A vocoder then converts the speech features into raw waveforms. Extensive evaluations confirm the effectiveness of our setup, achieving state-of-the-art performance on the unconstrained Lip2Wav dataset and the constrained GRID and TCD-TIMIT datasets. Speech samples from RobustL2S can be found at https://neha-sherin.github.io/RobustL2S/
翻译:说话人相关的唇语到语音合成(Lip-to-Speech Synthesis)研究已取得显著进展,该任务旨在从无声说话人脸视频中生成语音。当前最先进的方法主要采用非自回归序列到序列架构,直接从唇部表征预测梅尔频谱图或音频波形。我们假设,由于语音内容与环境信息及说话人特征的纠缠,直接进行梅尔频谱预测会降低训练/模型效率。为此,我们提出RobustL2S——一种模块化的唇语到语音合成框架。首先,非自回归序列到序列模型将自监督视觉特征映射到解耦的语音内容表征;随后,声码器将语音特征转换为原始波形。广泛评估验证了本方案的有效性,在无约束的Lip2Wav数据集以及约束性GRID和TCD-TIMIT数据集上均达到最先进性能。RobustL2S的语音样本可访问 https://neha-sherin.github.io/RobustL2S/ 获取。