The goal of Automatic Voice Over (AVO) is to generate speech in sync with a silent video given its text script. Recent AVO frameworks built upon text-to-speech synthesis (TTS) have shown impressive results. However, the current AVO learning objective of acoustic feature reconstruction brings in indirect supervision for inter-modal alignment learning, thus limiting the synchronization performance and synthetic speech quality. To this end, we propose a novel AVO method leveraging the learning objective of self-supervised discrete speech unit prediction, which not only provides more direct supervision for the alignment learning, but also alleviates the mismatch between the text-video context and acoustic features. Experimental results show that our proposed method achieves remarkable lip-speech synchronization and high speech quality by outperforming baselines in both objective and subjective evaluations. Code and speech samples are publicly available.
翻译:自动配音(AVO)的目标是根据给定文本脚本,生成与无声视频同步的语音。当前基于文本到语音合成(TTS)构建的AVO框架已展现出令人瞩目的成果。然而,现有的AVO学习目标采用声学特征重建,这为跨模态对齐学习带来了间接监督,从而限制了同步性能与合成语音质量。为此,我们提出了一种新颖的AVO方法,利用自监督离散语音单元预测的学习目标,不仅为对齐学习提供了更为直接的监督,还缓解了文本-视频上下文与声学特征之间的不匹配问题。实验结果表明,我们的方法在主客观评估中均优于基线方法,实现了显著的唇音同步效果与高语音质量。代码与语音样本已公开提供。