Given a piece of text, a video clip and a reference audio, the movie dubbing (also known as visual voice clone V2C) task aims to generate speeches that match the speaker's emotion presented in the video using the desired speaker voice as reference. V2C is more challenging than conventional text-to-speech tasks as it additionally requires the generated speech to exactly match the varying emotions and speaking speed presented in the video. Unlike previous works, we propose a novel movie dubbing architecture to tackle these problems via hierarchical prosody modelling, which bridges the visual information to corresponding speech prosody from three aspects: lip, face, and scene. Specifically, we align lip movement to the speech duration, and convey facial expression to speech energy and pitch via attention mechanism based on valence and arousal representations inspired by recent psychology findings. Moreover, we design an emotion booster to capture the atmosphere from global video scenes. All these embeddings together are used to generate mel-spectrogram and then convert to speech waves via existing vocoder. Extensive experimental results on the Chem and V2C benchmark datasets demonstrate the favorable performance of the proposed method. The source code and trained models will be released to the public.
翻译:给定一段文本、一个视频片段和一个参考音频,电影配音(也称为视觉语音克隆V2C)任务旨在利用期望的说话者声音作为参考,生成与视频中说话者情感相匹配的语音。V2C比传统文本转语音任务更具挑战性,因为它额外要求生成的语音精确匹配视频中呈现的不断变化的情感和语速。与先前工作不同,我们提出了一种新颖的电影配音架构,通过分层韵律建模解决这些问题,该模型从三个方面(嘴唇、面部和场景)将视觉信息与相应的语音韵律联系起来。具体而言,我们将嘴唇运动与语音时长对齐,并基于近期心理学发现中启发得到的效价和唤醒表征,通过注意力机制将面部表情传递至语音能量和音高。此外,我们设计了一个情感增强器,从全局视频场景中捕捉氛围。所有这些嵌入共同用于生成梅尔频谱图,然后通过现有声码器转换为语音波形。在Chem和V2C基准数据集上的广泛实验结果表明,所提出方法具有优越性能。源代码和训练模型将向公众发布。