Given a script, the challenge in Movie Dubbing (Visual Voice Cloning, V2C) is to generate speech that aligns well with the video in both time and emotion, based on the tone of a reference audio track. Existing state-of-the-art V2C models break the phonemes in the script according to the divisions between video frames, which solves the temporal alignment problem but leads to incomplete phoneme pronunciation and poor identity stability. To address this problem, we propose StyleDubber, which switches dubbing learning from the frame level to phoneme level. It contains three main components: (1) A multimodal style adaptor operating at the phoneme level to learn pronunciation style from the reference audio, and generate intermediate representations informed by the facial emotion presented in the video; (2) An utterance-level style learning module, which guides both the mel-spectrogram decoding and the refining processes from the intermediate embeddings to improve the overall style expression; And (3) a phoneme-guided lip aligner to maintain lip sync. Extensive experiments on two of the primary benchmarks, V2C and Grid, demonstrate the favorable performance of the proposed method as compared to the current state-of-the-art. The source code and trained models will be released to the public.
翻译:给定一段剧本,电影配音(视觉语音克隆,V2C)的挑战在于基于参考音频轨道的语调,生成与视频在时间和情感上高度对齐的语音。现有最先进的V2C模型根据视频帧的划分对剧本中的音素进行切分,这解决了时间对齐问题,但导致音素发音不完整和身份稳定性差。为解决该问题,我们提出StyleDubber,将配音学习从帧级别切换至音素级别。该方法包含三大核心组件:(1)音素级多模态风格适配器,从参考音频中学习发音风格,并基于视频中的面部表情信息生成中间表示;(2)话语级风格学习模块,指导中间嵌入的梅尔频谱解码和精炼过程,以增强整体风格表达;(3)音素引导的唇形对齐器,保持唇部同步。在V2C和Grid两个主要基准上的大量实验表明,与现有最先进方法相比,所提方法性能优越。源代码和训练模型将向公众开源。