Recent research has demonstrated impressive results in video-to-speech synthesis which involves reconstructing speech solely from visual input. However, previous works have struggled to accurately synthesize speech due to a lack of sufficient guidance for the model to infer the correct content with the appropriate sound. To resolve the issue, they have adopted an extra speaker embedding as a speaking style guidance from a reference auditory information. Nevertheless, it is not always possible to obtain the audio information from the corresponding video input, especially during the inference time. In this paper, we present a novel vision-guided speaker embedding extractor using a self-supervised pre-trained model and prompt tuning technique. In doing so, the rich speaker embedding information can be produced solely from input visual information, and the extra audio information is not necessary during the inference time. Using the extracted vision-guided speaker embedding representations, we further develop a diffusion-based video-to-speech synthesis model, so called DiffV2S, conditioned on those speaker embeddings and the visual representation extracted from the input video. The proposed DiffV2S not only maintains phoneme details contained in the input video frames, but also creates a highly intelligible mel-spectrogram in which the speaker identities of the multiple speakers are all preserved. Our experimental results show that DiffV2S achieves the state-of-the-art performance compared to the previous video-to-speech synthesis technique.
翻译:摘要:近期研究表明,视频到语音合成(从视觉输入单独重建语音)取得了显著成果。然而,由于模型缺乏足够指导以推断正确内容并生成恰当声音,先前工作难以实现精准语音合成。为解决该问题,现有方法采用额外的说话人嵌入作为语音风格指导(来源于参考听觉信息)。但推理阶段通常无法从对应视频输入中获取音频信息。本文提出一种基于自监督预训练模型与提示调优技术的新型视觉引导说话人嵌入提取器,该方法可仅从输入视觉信息生成丰富的说话人嵌入表示,推理阶段无需额外音频信息。基于提取的视觉引导说话人嵌入表征,我们进一步开发了名为DiffV2S的扩散式视频到语音合成模型,该模型以说话人嵌入和从输入视频提取的视觉表征为条件。所提出的DiffV2S不仅保留输入视频帧中的音素细节,还能生成高可懂度的梅尔频谱图,完整保留多说话人的身份特征。实验结果表明,相较于现有视频到语音合成技术,DiffV2S达到了最优性能。