Video-to-speech synthesis involves reconstructing the speech signal of a speaker from a silent video. The implicit assumption of this task is that the sound signal is either missing or contains a high amount of noise/corruption such that it is not useful for processing. Previous works in the literature either use video inputs only or employ both video and audio inputs during training, and discard the input audio pathway during inference. In this work we investigate the effect of using video and audio inputs for video-to-speech synthesis during both training and inference. In particular, we use pre-trained video-to-speech models to synthesize the missing speech signals and then train an audio-visual-to-speech synthesis model, using both the silent video and the synthesized speech as inputs, to predict the final reconstructed speech. Our experiments demonstrate that this approach is successful with both raw waveforms and mel spectrograms as target outputs.
翻译:视频到语音合成涉及从无声视频中重建说话人的语音信号。该任务的隐含假设是声音信号要么缺失,要么含有大量噪声/失真,以至于无法用于处理。已有文献中的工作要么仅使用视频输入,要么在训练阶段同时使用视频和音频输入,并在推理阶段丢弃输入音频通路。本研究探讨了在训练和推理阶段同时使用视频与音频输入对视频到语音合成效果的影响。具体而言,我们利用预训练的视频到语音模型合成缺失的语音信号,然后训练一个音视频到语音合成模型,将无声视频与合成语音同时作为输入,以预测最终重建的语音。实验表明,该方法在将原始波形和梅尔频谱图作为目标输出时均能取得良好效果。